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ABSTRACT 


In this thesis a new type of information retrieval system is 
suggested which utilizes data of the type generated by the users of the 
system instead of data generated by indexers. 


The theoretical model on which the system is based consists of | 
three basic elements. The first element is a measure of the related- 
ness between document-pairs. It is derived from information theory. 

The second element is a definition of what constitutes a set (cluster) 
of inter-related documets. This definition is based on the measure of 
relatedness. The last element is a procedure which transforms a request 
for information into a cluster of answer documents. 


Requests are made by designating one or more documents to be of 
interest and perhaps some to be of no interest. The requestor can 
continue to interact with the procedure as it locates the answer cluster 
py specifying as interesting or not interesting other documents which 
are presented to him. The answer cluster which is generated is auto- 
matically made as small (specific) or as large (general) as is desired, 
depending on the initial request and the subsequent interactions. 


An experimental system was developed to test the model in a 
realistic environment. It was programmed for the Project MAC time- 
sharing system and utilized the physics data file of the Technical 
Information Project. Citations were used as the data base for the 
measure of relatedness. A file structure and retrieval language were 
designed which allowed close manemachine coupling. 


Experiments were conducted which compared the clusters of docu- 
ments produced by the experimental system with various sets of documents 
of known mutual pertinence. These sets included bibliographies from 
review articles, subject categories, and sets of documents found to be 
of interest to selected users of the system. It was found that between 
60-90 % of the documents of known pertinence were included in the 
corresponding clusters. Ways of improving this retrieval efficiency 
even further are suggested. 


Thesis Supervisor: Robert M. Fano 
Title: Ford Professor of Engineering 
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PART ONE: INTRODUCTION 


This thesis is divided into four parts. In 
this part we introduce the project by describing 
results of related work and by discussing the 
objectives of the research. In Part Two the 
theoretical model on which the project is based 
is presented. Part Three contains a description 
of the experimental system which was developed to 
test the model. In the final part we present the 
experimental results and the conclusions about the 


theoretical model that can be drawn from them. 


1) 


CHAPTER I 


BACKGROUND 


1.1 Introduction 

In a pioneering article written at the close of World War II, Dr. 
Vannevar Bush, Director of the Office of Scientific Research and Develop- 
ment, called on scientists to redirect their energies to creating "a new 
relationship between thinking man and the sum of our knowledge." He 


noted that ‘our methods of transmitting and reviewing the results of 


research are generations old and by now are totally inadequate."2? 

His challenge to mechanize and streamline the library process has 
been accepted by numerous groups in the intervening twenty years. A 
large number of devices have been developed which mechanically or 
electronically select information from a store. Methods of automatically 
indexing, classifying, and abstracting documents have been devised. A 
myriad of other disciplines have been called in for assistance. 

Before attempting to review and evaluate this activity, it is 
extremely important that the implied "inadequacies" of traditional 


library methods be clearly defined. Only then can one hope to deter- 


mine the effectiveness of any given approach in resolving these problems. 


1.2 Areas Needing Improvement 
Six general aspects of library systems have been chosen as impor- 
tant areas which need improvement and which appear to be amenable to 


improvement through some type of mechanization. Most information 


ly 


storage and retrieval projects have had as their stated or implied goals 


one or more of these objectives, 


1.21 Closer Man-System Coupling 


In many cases a user who comes to an information system cannot 
state precisely what he wants. He has a very real need for information, 
but he cannot define exactly what that need is verbally. Im other 
cases a user can accurately specify his interests but changes his mind 
as to what he wants when he finds that there are too many or too few 
articles which satisfy the request. 

Unfortunately most systems (automatic and manual) are designed for 
that rare individual who knows exactly what he wants and what the stack 
contains. In these systems there is a clear demarkation between request 
specification by the user and answer presentation by the system. 

A much closer coupling of man and system is generally needed so 
that each can contribute to the best of his (its) ability at each step 
in the search. For example, the system might help the user in formulating 
the request by noting with each change in the request the probable number 
of documents in the final answer, by presenting representative documents 
for evaluation, and by ranking the output according to degree of related- 
ness. The user, on the other hand, could help the system find the desired 
answer by catching and correcting possible misunderstandings of the 
request as early in the search as possible, by narrowing or broadening 
the request if the size of the expected answer becomes too large or too 
small, and by continually refining the request based on the information 


supplied by the system. 


1.22 More Flexibility in Requests 
Even if it is assumed that a user can adequately specify his 


interests, there is still the difficulty of matching his request vocab- 
ulary with the vocabulary of the indexer. Perhaps the user is looking 
for books on “information retrieval" but fails to realize that the 
classifier posted such books under “documentation”. Of course, the 
classifier may have foreseen this difficulty and placed a "see" card 
under information retrieval. However, this does not always occur. 

Another basic problem is faced by the person who knows a given 
paper or a given author of interest but is forced to translate this 
knowledge into a set of descriptors instead of being able to feed it 
in directly as a request. 

More flexibility is needed in the allowable vocabulary, language 


structure, and type of information which can be specified in a request. 


1.23 _ Physical Barriers 


The mere physical separation of the user from the library presents 
a barrier that has a greater impact than we may realize. This is also 
true of the separation of the card file from the stacks. Evidence of 
the importance of this factor is found in the popularity of small 
special collections distributed throughout a large organization and in 
the personal libraries maintained by most research workers. 

There is also the time barrier. If a person could get an answer to 
his problem in five minutes, he might be interested. Whereas he might 
decide to bypass the problem if it takes one-half hour or more. A 
third barrier is cost. This factor is not a direct consideration to the 


user in most cases because no direct fee is levied for use of a library. 
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1.2h Quality of Selection Information 

All libraries provide the user with certain types of information 
which help him to select from the total store those books which are of 
interest to him without having to scan the text of each book. Even 
those libraries which cater to the browser generally arrange books by 
content on the shelves and place the spine out so that the title and 
author can be seen at a glance. 

There are at least three important factors which must be considered 
in the generation of selection information for a given document. 

1. The actual contents of the document. 

2. The collection in which the document will reside. 

3. The needs and characteristics of the user population 

serviced by the collection. 

If the only factor to be considered in indexing were the contents 
of the document, then a valid method for indexing would be to have each 
author, as the final authority on what the document contains, index it. 
However, libraries have found that the other two factors are also 
important and that an author cannot be expected to be familiar with 
each library and each user population that might have his book or 
article. 

The approach used by conventional libraries is to rely on an 
indexer or classifier to generate the selection information needed. 
This type of individual is usually an expert on the contents of the 
library collection, but knows much less about the first and third 
factors. He usually has about 10-15 minutes' time to determine what 
the author of the document has said and predict the types of users this 


information will be of interest to (through the categories selected); 


all this with little direct involvement in the field or area in question. 
The amazing part about the whole process is that an indexer can some- 
times come up with a sketchy, but fairly useful portrayal of the docu- 
ment. 

An additional problem is that much of the literature (periodicals, 


technical reports, etc.) never even receives the attention of an indexer. 


1.25 Restrictive Classification Model 

Even if the classifier were able to determine the exact contents of 
a document, he would still find difficulty in fitting his findings into 
the rigid classification systems currently in use (Dewey Decimal, 
Library of Congress, etc.). 

First, the classifier is allowed only a yes-no type of response. 
Either the document is placed in a given category or it is not--there is 
no middle ground, no partial relationship. 

Next there is the "broken relationship" problem inherent in hier- 
archal classification structures. No matter where a category is placed 
in the hierarchy tree, there are related fields to which it cannot be 
adjacent. For example, if the history of physics is placed in the 
science area, it loses its connection to history and vice-versa. This 
problem is only partially alleviated by the "see" and "see alsa” 
artifices. 

Third, there is the difficulty encountered in changing a classifica- 
tion structure to fit with our current body of knowledge. This involves 
considerable expansion and contraction of areas along with insertion of 
entirely new fields and the deletion of obsolete ones. The old classi- 


fication framework eventually becomes so strained in certain areas that 
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there is danger of collapse. 

Each of these difficulties encountered in the classification of 
documents generates a corresponding difficulty for the user. V. Bush 
described the use of a classification system in this way. 


"information is found (when it is) by tracing it down 
from subclass to subclass. It can be in only one place, 
unless duplicates are used; one has to have rules as to which 
path will locate it, and the rules are cumbersome. Having 
found one item, moreover, one has to emerge, from the system 


and re-enter on a new path."19 


1.26 Need for Dynamic Indexing 


Consideration of the problem of indexing leads one to the con- 
clusion that there is no intrinsic content to a document which, when 
once properly characterized by an appropriate set of words or phrases, 
is then adequately indexed for all situations and all users. In reality 
the depth and type of indexing needed depends both on the character- 
istics of the collection in which the document is imbedded and on the 
interests of the user population to be serviced by the collection at 
the time. 

Once this point is conceded then it becomes apparent that the way 
a document is indexed must change as the collection and user population 
vary. One of the major drawbacks of conventional indexing methods is 
that in practice they are static. A document, once indexed, is almost 
never re-indexed. Indeed some people believe that a properly indexed 
document should never need re-indexing. R. A. Fairthorne claims the 


following-- 


19 


"We have to assume that a classifier can decide that a 
text is relevant to a topic in such a way that, apart from 
blunders, neither future development nor decisions elsewhere 
shall compel revision. Future developments certainly should 
not upset any decision about relevance; if an item is relevant 
to some topic, it will always be relevant, though the relevance 


may become unimportant and new relevancies may be addea.">" 


The case for dynamic indexing was clearly presented by M. M. 


Kessler: 


1.3 


"Indexing must be fluid and dynamic, reflecting the 
changing needs of society and the contributions of new insights. 
It is most unlikely that anybody, be he expert scientist or 
expert indexer, can read a given paper at a given time and see 
enough of its implications to classify it once and for all. If 
this philosophy of classification were accepted, as it now is, 
the resulting system would impose such a rigidity upon the flow 
of information that the working scientist would be forced to 


ignore it."26 


Evaluation of Previous Efforts 


It would be impossible to describe all of the work which has been 


undertaken in the field of information retrieval and documentation in 


the last 20 years. What will be attempted here is an analysis of cer- 


tain representative efforts in each of six broad areas, 


1.31 Hardware Developments 


Many interesting machines have been developed for use in informa- 


tion processing (Rapid Selector, Peekaboo, Zator, Walnut, Minicard, 


general purpose computers, etc.). Instead of discussing the specific 


capabilities of these machines, let us note some of the general trends 


in hardware development which promise to have the greatest impact on 
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information retrieval. 

The first would be the development of multiply-accessed (time- 
sharing ) computers .-+ A research worker with a connection to such a 
computer would be able to query a large central store of information 
directly from his office, laboratory, or home and receive an almost 
immediate response. This is in contrast to the batch-processing com- 
puter which processes requests in groups at a central location and 
usually involves delays in response of from several hours to several 
days. A brief description of a particular time-sharing system (the one 
used by this research project) can be found in Sec. 6.1. 

A system of users interacting with a large central information 
store through a time-shared computer offers another important capability 
that might be overlooked. Not only can the user obtain information 
from the system, but the system can also monitor the user. This moni- 
tored usage data could be collected at little or no inconvenience to 
the user. It would complete the information loop with feedback from 
the user continually modifying and improving system performance. 

Another significant hardware advancement is the development of 
larger and larger mass memories. It is estimated that all of the text- 
ual information in the 20 million documents in the Library of Congress 


could be stored in a 10 trillion-pit (107°) 
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memory. Current random 
access devices store 10? - 10°° bits,while large magnetic tape install- 
ations have a capacity of 107+ bits. Random access storage devices have 
been announced in the iol bit range. It would appear that continued 
progress may soon eliminate storage capacity as a limiting factor in 


the mechanization of large information retrieval systems. 


A parameter closely related to memory size is access time. 
Typical access times to any part of a 07-bit file on a random access 
disc are currently 100 ms. The real problem is in knowing which part 
of the file to read. Perhaps associative memories, complete file 


inversion, or some other artifice will resolve this problem. 


1.32 Indexing Methods and Models 
As important as hardware developments are, V. Bush pointed out an 
even more basic problem. 


"Te real heart of the matter of selection, however, 
goes deeper than a lag in the adoption of mechanisms by 
libraries, or a lack of development of devices for their 


use. Our ineptitude in getting at the record is largely 


caused by the artificiality of systems of indexing."1° 


The ‘systems of indexing’ to which Bush referred are, of course, 
the traditional subject catalog and classification schemes still in use 
(Universal Decimal, Library of Congress, etc.). Some of the drawbacks 
of ‘these classification systems were discussed in Section 1.25. 

Beginning about 1950 efforts were made to replace these convention- 
al classification methods. One result was "coordinate indexing tt In 
coordinate indexing documents are assigned Uniterms or descriptors 
(usually single words). These descriptors are given no hierarchal or 
other structure. A request consists of certain descriptors connected 
by the logical and-or-not operations. 

Coordinate indexing eliminated many of the difficulties encountered 
in hierarchal classifications and subject catalogs. However, its 
strength was also its shortcoming. The elimination of all order and 


structure from the descriptors introduced many 'false drops'. For 
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example, a hypothetical user looking for papers on the causes of blind- 
ness in Venice might also retrieve articles on the design of Venetian 
blinds. To reintroduce that which was lost by eliminating descriptor 
context and order, such features as role indicators were used. 

Currently some workers in the field seem to be disenchanted with 
coordinate indexing and have shifted reluctantly back to the conventional 
classification methods . 1° 

Another field of endeavor was in the modeling area. A number of 
models were proposed which described the indexing and retrieval functions. 
Unfortunately that was all that these models did ~- they provided an 
alternate way of describing an already familiar problem. No new insights 


were gained and no helpful procedures resulted. 


1.33 New Bases for Selection Information om 

It has already been noted that all library systems depend on 
selection information (classification categories, subject headings, 
author indexes, etc.) to locate documents relevant to a particular 
request. Customary library practice is to depend on the indexer to 
produce this information. Section 1.2h outlines some of the diffi- 
culties inherent to this dependence. 

Studies during the past eight years have been undertaken to see if 
selection information generated by indexers can be supplemented and per- 
haps replaced by that generated by the automatic processing of a docu- 
ment's contents. 

At first simple methods of exploiting the information found in a 
document were tried. Permuted title indexes and eitaticn indexes met 


with some success. In 1958 Luhn proposed automatic abstracting. -+ 


This consisted of the selection of certain words as the keywords of a 
document based on their frequencies of occurrence. The sentences and/ 
or phrases which contained these words were then extracted to form the 
auto-abstract of the document. The idea was then extended by Maron in 
1961 to the automatic indexing of documents with the keywords extracted 
becoming the descriptors.°©?? 

Automatic indexing was about 50 % successful in assigning documents 
to the same categories that the human indexer aia.t© This mediocre 
showing can be attributed to the fact that machine indexing did not 
make use of the order, context, syntax and synonyms of the words 
extracted. This in essence is the same difficulty found in coordinate 
indexing. Some of the subsequent efforts at automatic indexing 
attempted to account for syntax, but this trail encountered the same 
massive obstacles that had already slowed progress in automatic language 
translation. 

Thus after some initial success, the automatic generation of 
selection information based on document contents ran aground. One 
cannot dispute the fact that a description of the subject covered by 
the article is contained within the article. Just how one can capitalize 
on that knowledge is the problem. The needed information is there, but 
machines and indexers currently can extract only a part of it. 

There is one notable exception to the above comments. The 
citations found in articles do not have the same type of synonym and 
syntax problems that textual material does. Thus selection information 
generated from citations has had considerable success for those bodies 
of literature which have a good citation pase .78 


A discussion of the user of a library as a source of selection 
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information will be postponed until Chapter II, since little, if any, 


prior experimental work has been done in this area. 


1.3h Measures of Relevance 

In conventional library systems documents are assigned to 
categories and subject headings on a yes-no sort of basis. Either the 
document is in the category or it is not--there is no middle ground. 
The restrictive nature of this type of arrangement was pointed out by 
Maron and Kuhns in 1960.72 They proposed that an 8-value weighted 
indexing scheme be used to represent the degree to which a document is 
related to a term. 

This idea was extended to thesauri by Stiles in 1961443 A tradi- 
tional thesaurus allows terms to be listed as synonyms or antonyms but 
the degree of synonymity is left unspecified. Stiles proposed an 
association factor to represent the amount of synonymity between terms. 

Numerous other ‘measures of relevance’ between the various 
entities of libraries have been proposed since. Some of the better 
known of these measures are tabulated in Appendix A. Unfortunately, 
there appears to be considerable confusion over exactly what these 
measures represent, and the use of the term ‘relevance’ would seem to 
add to this confusion. 

Many documentalists now speak with some assurance about the amount 
(to 30rh significant figures) of 'relevance' of a document to a 
category or to a request. The ‘relevance ratio' is an accepted way to 
measure information retrieval system efficiency. All too often these 
comments leave one with the impression that there is some intrinsic 


meaning to a word or document which has now been quantitatively described, 


when in reality all that has been accomplished is the invention of some 
type of frequency ratio. 

In traditional library work confusion also appears to exist. Indeed 
the very idea of classification implies to some that there is some 
inherent content of a document which must be indexed. The already quoted 
comment by R. A. Fairthoren can be cited as an expression of the 
attitude of some classifiers. 


"Future developments certainly should not upset any 
decision about relevance; if an item is relevant to same 
topic, it will always be relevant, though the relevance may 


become unimportant and new relevancies may be addea "1" 

Let us suggest that the intrinsic meaning or concept behind a word 
is a philosophical problem and cannot be dealt with operationally. 

Those aspects of a document which do not influence its environment (i.e. 
the library and the user) are of no practical significance because they 
cannot be observed, measured, or even proved to exist. 

To avoid adding further to this misunderstanding we shall avoid the 
use of the word 'relevance' in the rest of this paper. ‘The frequency 
ratios used by this project will be termed ‘measures of relatedness’. 

It is hoped that this term is less loaded with connotations of intrinsic 


meaning. 


1.35 Automatic Classification and Clumping Experiments 
After automatic indexing was proposed for the assignment of docu- 


ments to categories, it was only natural that the automatic determina- 
tion of the categories themselves should be tried also. This was done 
initially by borrowing two techniques from mathematical psychology-- 


factor analysis and latent class analysis. Factor analysis is used to 
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discover the underlying factors which account for the performance of a 
group of people to a battery of tests. Latent class analysis is a 
procedure used to divide a group of people into disjoint sub-groups on 
the basis of their responses to a questionnaire. 

Latent class analysis for information retrieval has not yet been 


2 


experimentally testea.?? Borko's work with factor analysis was based 


on the occurrence of keywords in document svetractac’ A correlation 
matrix of keywords versus keywords was formed and was factor analyzed, 
resulting in categories which had some resemblance to those manually 
selected for the same corpus. 

An even earlier attempt at automatic classification was tried by 
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Needham and Parker-Rhodes in England. They called it clumping 


and produced a heuristic procedure which selected clumps of documents 


from a file. Their work has been extended in this country by Dale’? 


and also by Sunneiee 
Since clumping is the most closely related endeavor to the obdject- 
ives of this project of any to date, a slightly more extended description 
of the results will be given. A library collection is thought of as a 
network with the nodes representing documents and values assigned to 
the links (usually O or 1 only). This collection is partitioned into 
two subsets, A and B. The sum of the links internal to A is denoted by 
AA and the sum of the links internal to B is denoted by BB. The only 
other links in the network are those which cross from set A to set B. 
The sum of these links is designated AB. 
A GR clump is defined as any set A which produces a local minimum 
of the function F(a). 


F(A) ee 


AA + BB 
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A more recent type of clump, the D clump, is defined as any set A 

which produces a local minimum of the function G(a).2? 
Gla) = Lae 
“ V (aa) (BB) 

GR clumps are fairly easy to locate. Some additional restrictions 
must be placed on D clumps to make the definition useful since local 
minima of G(A) occur for quite unrelated sets of documents. The latest 
effort has been to find an initial set of items by some other method and 
then use the D-clump method to complete the set. 

Both the automatic classification and the clumping experiments are 
designed so that all of the classifying and indexing would be completed 


before the requests are processed. 


1.36 Systems Evaluation 

The most widely accepted method of evaluating the performance of 
information retrieval systems is currently through the recall and 
relevance vatiges? The recall ratio is the percentage of relevant 
items that are actually retrieved and the relevance ratio is the percent- 
age of retrieved items that are relevant. 

In determining what is or is not relevant, recourse is usually 
made to an indexer or a user. Recent studies have shown that these 
people are able to agree among themselves as to how documents should be 
classified in at most 80% of the cases. This "failure" of humans to 
index consistently has led some to try to find better automatic "non- 
judgemental" standards on which to validate werevanee,-° 

If the primary objective of a library is in serving a given user 


population, then it is difficult to imagine that there could be any 
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criteria for relevance other than one based on those users. If, on the 
other hand, the function of a library is to set up a universal classi- 
fication system, then the user should certainly be eliminated as the 
standard on which system efficiency is evaluated, 

The idea that the users of a system can "fail" in classifying a 
document implies an intrinsic content in documents which one or more of 
the users has not recognized. A more practical outlook in keeping with 
the arguments of Sec. 1.3 is that these differences in indexing are 


only the normal result of individual backgrounds and interests. 


CHAPTER II 


OBJECTIVE OF THIS PROJECT 


2.1 Brief Description of Project Objective 


Let us assume for a moment that we wish to design an information 
storage and retrieval system which is based on feedback from users. In 
this system each request for information is to consist of a set of one 
or more documents that the user has already found to be of interest and 
a second (possible empty) set of documents that he knows are not of 
interest. 

The purpose of each interaction of a user with the system is to 
transform a request of this type into a partitioning of the total collec- 
tion into two disjoint subsets--one containing all documents that are of 
interest to the user and the other containing those not of interest (the 
rest of the stack). This process is to be accomplished jointly by the 
user and the system. 

The feedback which the system stores for use in answering future 
requests is to consist of these file partitionings. A measure of the 
relatedness between any two documents based on their usage and co-usage 
patterns as found in the partitionings is to be utilized to facilitate 
the request-to-answer transformation. 

The document collection of such a system can be thought of as a 
network where each node represents a document and each link is given a 
value corresponding to the measures of relatedness between the two 


linked documents. 
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The objective of this research endeavor is to devise, test, and 
evaluate a procedure which will perform the transformation of request 
to answer partition for this type of retrieval system. 

In the above discussion we suggested for purposes of illustration 
a retrieval system based on file partitionings which are generated by 
the users of the system. Partitioning information of this sort would 
not be available for documents that have just been added to a file. 
Indeed, such information is not readily available for any file of docu- 
ments at the present time. 

There are, however, some types of partitionings which are available. 
Take, for example, the citations in an article. The author of an article 
selects for citation certain documents that he feels are pertinent to 
the article he has written. In a sense he is a special type of user of 
the library and has created a meaningful partition of the file. Other 
types of partitionings of the file could also be suggested. 

Usage information was selected for discussion here because it is 
an interesting and representative example of the larger class of parti- 
tioning information for which we propose to design a retrieval system. 

In the remainder of this chapter and in the next chapter we will, 
therefore, continue to talk in terms of the partitionings generated by 
users. It should be understood, however, that the type of retrieval 
system to be developed need not be restricted to this single type of 
partitioning data. 

In the next section we will present some arguments for and 
against information retrieval based on usage information. We will then 


discuss how usage information can best be represented and utilized. 
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2.2 Value of Usage Information 

In the article already cited at the beginning of Chapter I, V. 
Bush suggested that an individual's personal information storage and 
selection system could be based on direct connections between documents 
instead of the usual connections between index terms and documents. 
These direct connections were to be stored in the form of trails through 
the literature. Then at any future time the individual himself or one 
of his friends could retrace this trail from document to document with- 
out the necessity of describing each document with a set of descriptors 
or tracing it down through a classification tree, 1° 

In 1956 R. M. Fano suggested that a similar approach might prove 
useful to a general library. He proposed that "the concomitant use of 
documents by experts as evidenced by library records, and other similar 
joint events" might be a useful basis for document retrieval.19?19 His 
proposal evoked a number of adverse comments, two of which will be quoted 


here. 


2.21 Objections 


A theoretical objection to basing retrieval on usage was raised by 
Y. Bar-Hillel. 


"A colleague of mine, a well-known expert on 
information theory, proposed recently, as a useful tool for 
literature search, the compiling of pair-lists of documents 
that are requested together by users of libraries. He even 
suggested, if I understood him rightly, that the frequency 
of such co-requests might conceivably serve as an indicator 
of the degree of relatedness of the topics treated in these 
documents. 

"I believe that this proposal should be treated 
with the greatest reserve. Although much less ambitious 
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than Taube's proposal of an association dictionary, it is in 
many respects strikingly analogous to it and shares its short 
comings. The fact that a co-requestedness chain of documents 
can be easily followed up by a machine is not in itself a 
sufficient reason for making the assumption that this relation 
might be a useful approximation to the important relation of 
dealing-with-related-topics between documents. And one can 
think of many other easily establishable relationships between 
documents that stand a better chance of being a useful approxi- 
mation, e.g. co-occurrence of their references in reference 
lists printed at the end of many documents, co-quotation, and 


so on"? 
The shortcoming of 'Taube's proposal’ referred to in this quote is 
the familiar triangle argument. 


"Knowing that 'a' and 'b' co-occur...and that 'b' and '¢! 
co-occur...what do we know about the connection between the 


‘ideas' 'a' and 'c'? Clearly, nothing definite whatsoever..." 

What Bar-Hillel says is true also of hierarchal classification 
systems where the adjacency of categories a and b and of categories b 
and c proves nothing about the relationship of a and c. It is true of 
any system consisting of a set of items and characteristics that cannot 
be described by some type of metric space. 

On the other hand the fact that documents a and c are not related 
in every case when linked through a third document b is more of a hypo- 
thetical objection than a practical one. If, in fact, items with the 
a-c type connection are found to be related on the average much more 
frequently than items chosen at random, then the usefulness of this type 
of connection in document selection should not be overlooked. 

A second objection to Fano's suggestion was raised by C. N. Mooers. 


It is a practical instead of a theoretical objection. 
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"To provide feedback for improving machine performance 
Fano and others have suggested the use of statistics of the 
way which people use the library collection. Though the 
suggestion points in the right direction, I think this kind 
of feedback would be a rather erratic source of information 
on equivalence classes, because people might borrow books on 
Jack London and Albert Einstein at the same time. Although 
this difficulty can be overcome, there is a more severeproblem. 
Any computation of the number of people entering a library and 
the books borrowed per day, compared with the size of the 
collection shows, I think, that the rate of accumulation of 
such feedback information would be too slow for the library 


w Sly 


machine to catch up to and get ahead of an expanding technology. 

Mooers' objection assumes that the capability of accepting feedback 
from the user is to be superimposed on a conventional library structure 
and that it will have little net effect on the frequency of use of that 
library. Let us accept these assumptions for the moment and suggest 
some reasons why usage information would still prove profitable. 

First, libraries might well find it helpful to share usage patterns 
and thereby increase the total information available to any one library. 
Second, the well used documents will have plenty of usage statistics and 
be well ‘indexed', while unused books will have no statistics--a seem- 
ingly equitable arrangement. Third, even the information on one usage 
of a document may prove more valuable than the information supplied by 
the indexer of that document. Fourth, usage information is not pur- 
ported to be a cure-all which will replace all of the current types of 
selection information. It is felt to be a supplemental source of 
selection clues which should grow in importance as more user feedback is 


collected. 
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Now let us return to the initial assumptions and note that the 
number of people who enter a library is by no means an indication of 
the amount of time spent in the study of printed material. It is merely 
an indictment of current library practices. If, in fact, information 
were made available to research workers right in their offices through 
the type of computer time-sharing system described in Section 1.31, then 


the amount of feedback available from users should radically change. 


2.22 Supporting Arguments 

Thus far in this section we have cited two early proposals that 
document selection be based on user feedback. We have quoted both a 
theoretical and a practical objection to such an approach and have 
attempted to answer these objections. Let us now turn to some of the 
positive arguments favoring user feedback which, to this author at least 
are compelling reasons why document retrieval should be based on infor- 
mation from the user. 

The first argument has already been alluded to in Section 1.26. 
In this section the need for dynamic indexing was observed. It was 
noted that it is impossible for an indexer to foresee all of the possible 
applications of a paper at any given point in that paper's history and 
especially not just after it is written. 

fo account for the changing relationships and new applications of 
papers in a collection, a library must be supplied with information. 
Such information regarding the changing nature of the corpus must come 
from the three participants in the library process--author, indexer, 


and user, 
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To require indexers to periodically re-index the collection would 
be financially impossible. Many libraries find it difficult to even 
initially index each incoming document. 

The textual information placed in the document by the authors 
offers little help also. Take, for example, a research worker who 
publishes a new discovery. A terminology which eventually evolves to 
describe that discovery may be markedly different from the language of 
the initial paper. And it would be a rather momentous task to develop 
a thesaurus which could connect the groping language of the basic paper 
with the codified terminology which eventually results. 

Thus, the user is left as the one participant in the library 
system who is continually interacting with the collection and could 
introduce dynamic indexing into the system. 

Let us note at this point that citation information in newly added 
documents representsa specialized type of user information (the author 
acting as a user of the old file), and as such can act in the same way 
as usage information to give the system a changing indexing structure. 
Some other advantages of this source of indexing information were noted 


in Sec. 1.33. 


The second argument in support of the utilization of user feedback 
concerns the quality of the indexing which results thereby. The advant- 
age of having the indexing done by people actually immersed in a given 
research area can hardly be overemphasized. Hitherto neglected refine- 
ments and distinctions can be made, the structure of the field as the 
actual worker sees it can be established, and many unintentional 


blunders can be avoided. 
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It should be noted that the quality of indexing by usage is a 
controllable parameter. Take , for example, the users of articles in 
the Physical Review. This group of people represents a highly know- 
ledgeable and motivated segment of the population which should be able 
to form valid links between documents. If, however, the quality of the 
resulting indexing is still insufficient, the system could be designed 
to accept feedback from only a segment of the population--say the faculty 
but not the students. This could even be made a parameter specifiable 
by the user so that he could use the feedback from that segment of the 
population which most closely fitted his own background. 

A third reason for indexing by user feedback is that it may be 
possible to do it as a by-product of normal library use and thus avoid, 
to some extent, the high cost of indexing which currently burdens a 


library. 


2.23 Collecting Usage Information 
Let us now discuss the problem of how the intellectual decisions 


needed from the user can best be obtained. The sets of citations found 
in articles form one readily available source of sets of documents that 
have been judged mutually pertinent. The data used by the experimental 
portion of this project was taken from this source. (See Sec. 6.22) 

Let us consider for a moment whether a retrieval system could be 
designed which was based on usage data of the type described in Sec. 2.1. 
One major difficulty would be to devise some way of encouraging the 
user to supply the system with the data needed, Some possible ways 


this might be accomplished are the folowing: 
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1. The user finds that the system automatically disseminates to 
him new articles of interest if he has provided profiles of 
his interests in the form of sets of papers of known interest. 

2. The user finds that in interacting with the retrieval program 
he converges on papers of interest more rapidly if he tells 
the system whether each paper presented is of interest or not. 

3. The user contributes sets of related papers to the system 
because he wishes to improve its usefulness to himself and 
others. 

4. Certain users are provided monetary remuneration for supply- 


ing the system with sets of related documents. 


2.3 The Purpose of Measures of Relatedness 
The next question that arises after one has accepted the idea that 


information selection might appropriately be based on some type of usage 
data concerns the form that this data should be expressed in. One 
might propose that each usage set be treated the same way as a subject 
heading or descriptor set with its label being the name of the user 
that generated the set. Under this scheme one might retrieve all of th 
papers of interest to a given user or all of the papers whicn have been 
found of mutual interest with a selected paper. Indeed the ability to 
answer these types of questions is a valid capability to equip a 
retrieval system with. 

However, there are some significant differences between the sets of 
papers generated by users and the sets of papers generated by some type 
of indexing scheme. First, there is the fact that any given paper occurs 


in, at most, only a handful of indexing categories,while it might 
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possibly occur in a very large number of user sets. Second, there can 
pe any number of user sets centering around a given area of research, 
but this area would be normally covered by only one subject category. 
Third, usage sets would be continually added to the system, but new 
categories would be added infrequently. 

All this adds up to the fact that users who attempt to extract 
information from usage files with normal matching techniques will 
probably be overwhelmed with the non-uniform, massive, fluctuating 
nature of this type of data. 

Some type of statistical measure is needed which will combine and 
summarize the results of many user interactions. The specific charac- 


teristics which this measure should have are discussed in Chapter ITI. 
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PART TWO: THEORETICAL DEVELOPMENT 


The three chapters of this part describe the theoretical 
model on which the research project is based. There are three 


closely related components of the model. 


Chapter III: Measure of Relatedness 
Chapter IV: Cluster Definition 


Chapter V: Search Procedure 


The experimental system which was devised to test the 
applicability of the model to a real world situation will be 
described in Part Three. It is hoped that this organization 
will help in keeping the abstract ideas of the model separate 
from the particular physical implementation which was developed 
to test ‘hie It may be somewhat misleading, however. In 
actuality the model was not completely developed before the 
implementation began. It was continually revised and improved 
as various versions of experimental systems were programmed, 
tested and then discarded. What is described in this and the 


next part is the current model and test program. 
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CHAPTER III 


MEASURE OF RELATEDNESS 


The first step in establishing the conceptual basis of the research 
project is the selection of a measure of the relatedness between docu- 
ments. To this end a sample space will be defined and a probability 
distribution assigned to it. Then a measure based on these probabil- 
ities will be selected and some of its characteristics noted. Finally 


the document network generated by the measure will be described. 


3.1 Sample Space 

In order to motivate the choice of our mathematical model, we 
regard each interaction of a user with a library as a partitioning of 
the stack into two disjoint subsets of documents: one containing all 
the documents of interest to the user and the other containing the rest 
of the documents. Each interaction is assumed to have a single purpose 
in the sense that all documents of interest are of interest for the 
same purpose. 

There are theoretically 2” such partitionings possible for a stack 
of n documents. Now let us think of a discrete collection of a points 
(a sample space“), each representing one of the possible partitionings. 
These points can be identified by n-bit binary numbers, XyoeeXs where 
Xs is 1 if the <> document is in the subset of interest and 0 if it is 
in the subset of no interest for the partition in question. (A super- 


script will be used to denote the value of a variable: xt 


4=x,71-) 


For a given user population and document collection a probability 
distribution p(x,-.-x,) can be assigned to the sample space. Each 
p(x, -+-x,) may be regarded as the probability that a user chosen at 
random from the population will partition the document collection with 
the partition XpeeeX- 

Compound events can be defined in terms of the simple events repre- 
sented by the sample points. For example, p(x), the probability that 


document 1 will be of interest to some user can be obtained by summing 


the probabilities of all points for which x, =1. 


ly. 1 
p(x;) > BOs F -x,) 
XyeeeX, 
Similarly p(x}x5), the probability that documents 1 and 2 will be 
found to be of interest jointly, can be obtained by summing up the 


probabilities of all points for which x,=1 and x,=l. 


1 2 
POs )= 2 px x5x,-++x,) 
X3+00X, 
In the sections that follow we will want to talk not only about 
the abstract theoretical values of these probabilities, but also about 
their estimated values as obtained from experimental data. Suppose that 
there is information available on a large number of partitionings of a 
library. Let us make the following definitions. 


N: Total number of partitionings of the library that are 
available. 


N,: Number of partitionings in which document i occurs in the 


subset of interest. 


N Number of partitionings in which both documents i and j 


1c 
occur in the subset of interest. 


Based on these N's estimates of the probabilities can be made as 


2 


follows: 


ode 
] 


(xp 5) td 
etc. 

The partitioning data employed in these estimates may result from 
experimental evidence other than actual user interactions with the stack 
of documents in question. For instance, one might partition the stack 
on the basis of whether or not the documents cite a given document, or 
on the basis of whether or not they contain a particular word in their 
titles. As a matter of fact, the experimental system described in 
Chapter VI uses partitionings based on whether or not the documents cite 
a given document because these were readily available while actual usage 
data were not. 

This use of another type of partitioning data (other than usage 
data) by the experimental system is considered acceptable here since 
the purpose of the experimental portion of the project is to permit an 
investigation of general properties of the theoretical model that should 
be largely independent of the precise values of the probability esti- 


mates. 


3.2 Criteria for Selecting a Measure of Relatedness 
We have already noted in Sec. 1.34 that a number of measures of 


‘relevance’ have been suggested for us in information retrieval. Some 
of the more widely known of these measures are tabulated in Appendix A. 
The differences between them are partially due to the fact that they 


were designed for different purposes and partially due to the varied 


backgrounds of the people who proposed them. Some of them have a theo- 
retical basis in probability, statistics, or information theory; others 
are of an ad hoc nature. 

In Sec. 2.3 we discussed why a measure of relatedness was needed 
for this project. The purpose of such a measure is not to rate the 
individual or joint merit of the documents in the stack, but rather to 
represent their relationship in terms of frequency of use and co-use. 
To this end it was decided that the measure selected should have the 
seven characteristics listed below. 

Not all of the measures of Appendix A are expressible in terms of 
the theoretical probabilities of the last section. Therefore, for pur- 
poses of comparison we shall express these seven criteria in terms of 
the frequency counts on which the estimated probabilities are based. 
The N's are as defined in the last section, C is the measure of related- 
ness between documents i and j, and R=S g means that R monotonically 


increases with S as T is held constant. 


1. Co-occurrence Factor C=N 
ij 
N,N,N, 


The measure should monotonically increase with the number of 
co-occurrences in the subset of interest of the documents in question if 
all other factors are held constant. Consider, for example, a pair of 
documents (i,j) and another pair (r,s). If the N's are the same for 


both pairs except that Ny > N 


ig? then the relatedness between i and j 


should be greater than the relatedness between r and s. 


2. Other Usage Penalty Factor C=1/N, 
N,N,N, , 


The measure should monotonically decrease as the number of 


occurrences of one of the documents increases--all other factors being 
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held constant. That is, if document i is used a larger number of times 
but not in conjunction with document j, then the relatedness between i 
and j should decrease. 


3. Co-occurrence Ratio Pactor CH, ./N 
ij i N,N 


J 
If the ratio or fraction of the number of co-occurrences of 
document i with document j to the total occurrences of document i in- 
creases, the measure should increase also. Note that this criterion is 
not a consequence of 1 and 2. 
4. Function of Probability Estimates Only C(N,/N, N/M, N, f™) 
The measure should depend only on the ratios of frequency 
counts which are used to estimate the probabilities. As long as these 
ratios remain constant the measure should not change. 
5. Statistical Independence 
The one bench mark that is available for measures is the 
statistical independence of the events in question. It would seem log- 
ieal that if the occurrence of two documents are statistically indepen- 
dent, their measure of relatedness should have the value 0. 
6. Theoretical Basis 
A measure that has a solid theoretical basis is to be pre- 
ferred over one which has been developed by trial and error. 
7. Ease of Use 
The best measure is a simple one that is easy to calculate 


and manipulate. 


3.3 Selection of a Measure 
Let us now evaluate the measures of Appendix A in terms of the 


criteria of the last section. Measures (1) and (2) have no theoretical 
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basis (Criterion 6) and are not O for statistically independent events 
(Criterion 5). The Chi Square Formula (5) is not expressible in terms 
of the probability estimates (Criterion 4). The value of the Cosine 
Formula (6) for statistically independent events is/p(x;x;) which is 
neither O nor even constant. The Average Correlation Coefficient (7) 
does not satisfy Criteria 1, 2, or 3. 

This leaves Measures 3, 4, and 4 which meet (at least partially) all 
of the criteria listed. Measure 8 was selected for this research pro- 
ject because its foundation in information theory has led to some very 
interesting and useful results. 

The use of Measure (8) in document retrieval was first proposed by 
R. M. Fano™ 7, In its more general form it expresses the degree to which 
a set of events Sensis are correlated in terms of their individual 
and joint probabilities. 


1 1 
p(x}. e xD) 


1 1 
C(x) ...x,) log (1) 


I T 
p(x; -- P(x?) 


The base of the logarithm function used in the formula and through- 
out the remainder of this paper will be assumed to be 2. This will mean 
that the unit of correlation will be the "bit". 

If only 2 events, i and j, are considered, then the coefficient is 
equal to the mutual information, Ia) 57) between the 2 events as de- 


fined in information theory *° 


p(xix)) 


p(x )p(5) 


1 


C(x}x5) = T(xp5x,)= log (2) 


Let us relate the probabilities of formulae (1) and (2) to the 
probabilities of document usage defined over the sample space of the 


preceding section. The event = is now the occurrence of document i in 


ET ae eh a PO ee SE ES ee NE a ne ee aE ae ae CT, Nal ge tae aha ies Prpty fe Sag ate we cee) DB ee 
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a user's set of interest. The correlation e(xyxs) is the degree to 
which the two documents, i and j, are taken to be mutually pertinent. 
fhe approximation to C in terms of the estimated probabilities will 


be denoted by the symbol e. 


lu P(x;x;) et a a 
C(x;x;) = log — a= log ——ul. = F (x,x;) 
P(x, (x5) aN, 


3.4 Practical Considerations 
In order to calculate the measure of relatedness C for any arbi- 

trary set of documents selected from a collection of n documents, one 

would have to estimate and perhaps store at least gh-i probabilities. 

This is, of course, out of the question for any reasonably-sized docu- 

ment file. If C is to be used, some approximating simplification must 

be made. s 
Let us now note that this correlation coefficient C can be expanded 


in terms of mutual information terms as follows“: 


Tr 
C(x}...x2) = Drape) > 1(x¢5x45%) + see 
) 


where 


p(x, x, )p(x, x, )p(x,x,) 
1(x, 5x, 3x3) = log edb hayes es Sar ie a 
p(x, p(x, p(x p(x, x5; ) 

etc. 
It has been proposed that C be approximated by the first summation 
in this series, and that the other summations be dropped as higher- 


order effects. There are some theoretical reasons which would lead one 
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to believe that this would result in a good approximation to 29, How- 
ever, we shall rest our case here on practical necessity and not go into 
the details of these theoretical arguments. 

r r 11 

o(t..dy= ) 1x3xt) = DL tog eee 
iy" ey epee 
Por this approximation one need only estimate and store n univariate 

and (3) bivariate probabilities in order to obtain the correlation 
between events and subsets of events. 


Through the same approach one can obtain an approximation to the 


correlation between any two subsets of events-- 


cl (x;. ; xy. ‘ -)] ai iy 1(x;3¥5) 


od™l 
If these subsets overlap then one or more of the terms in the 
-sgeries becomes the self correlation of the event. 
11 
p(x; x;) 1 


C(xixt) = log ——t+—— » 10g —{— 
"ne et p(x; )p(x;) p(x;) 


3.5 Characteristics of the Measure for Document Pairs 
The measure of relatedness is 0 for two statistically independent 
events: 
p(xyx5) = p(x, )p(x;) 
Por events occurring together less often than if they were statistically 
independent, C is negative and for events occurring together more often 
C is positive. 


Theoretically the range of C is from - 0Oto tos. However, there is 
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a statement that can be made about the upper bound. Since plx3x;) cannot 


be larger than p(x) or p(x;) the following inequalities hold: 


1 
11, < log ——— 
(xx?) = lo Pog) ner p(x;) 
i%j 6 iy.7.1 
p(x; )p(x,) bs 1 
< log 
p(x;) 


The quantity logl1/p(x;)] is termed the self information of x in 
information theory °° Thus, the correlation between two events is always 
less than or equal to the self information of either event. Let us indi- 


cate this range on the simple graph of Fig. 3.1. 


Max{ 1og(1/p(x;))] 


Fig. 3.1. Range of measure of relatedness. 


Some additional comments about the range of the measure can be made 
if we consider <; the approximation to C based on the estimated proba- 


bilities. The maximum positive value of G is (log N) and occurs when 


Ni, ie and Ny all equal 1. Its minimum value other than -0o is (2-1log N) 
and occurs when Ns is 1 and N, and N, are N/2. This range is shown in 
Pig. 3.2. 

we 
ly Cc 
oa 2-log N log N 


Fig. 3.2. Range of approximation to measure of relatedness. 


Por the test data utilized in the experimental’ portion of this 
project (see Sec. 6.1) it was found that the C's were either -oo or had 
some positive value (see Fig. 3.3). The lower limit of (2-log N) in 
Pig. 3.2 is changed in Fig. 3.3 since all of the N,'s of the test data 


are much less than N/2. The new minimum of € occurs when Ni j=l and Ni 


and N, are maximum (called (Ns) max) 
ty ¢ 
log Eee aaa log N 
(Ns nex 


Fig. 3.3. Range of measure of relatedness for test data. 


The range fa the test data is due not so much to the fact that the 
occurrence of the documents in the test file are never statistically 
independent as to the fact that such statistical independence can only 
be detected with a very large data base. Consider documents i and j 
with p(x;); p(x;) = 9.0001. If xt and x are statistically independent, 
then p(x¢x,)=10°°. In order for any of the probability estimates to be 
this small we would need at least 10° partitionings. Many, many more 
partitionings than this would be needed if one wanted to have accurate 
estimates of the occurrences of such rare events. With fewer partition- 
ings these events either never occur, resulting in P(x5x5)#0, or do occur 
with the estimate for P(x7%5) being larger than it should be. This is 
the phenomenon observed for the test data. Even if there were correla- 
tions that were 0 or slightly negative they would be pushed to -o8 or to 
some positive value because of the limited number of partitionings 
available. 

It is conjectured that this will be the situation in most practical 
cases for some time to come. In a very large document collection 
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(10°10 items) the probability of occurrence of any one document is 


probably small, say 1073 or 1074, This would require a file of 10° to 
10° partitionings to measure statistical independence which would take 
considerable time and effort to collect. In a small document collection 


the probability of occurrence of any one document could be larger but the 
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number of partitionings available would undoubtedly be less also. 

It should be pointed out that this measure will assume some value 
for every pair of documents in the stack (except perhaps documents that 
have never been used). Even two documents that have never co-occurred 
together (N, 0) are related by the value -oo. 

A few comments should be made about the value -og. It is not a 
realistic value for the correlation between most documents because it 
implies that there is absolutely no chance of two documents co-occurring. 
As has already been pointed out this arises because the probabilities may 
end up exactly zero. A much more practical and reasonable approach to 
the problem would be to make all correlations between document pairs for 


which Ny =0 equal to some finite negative value instead of -co. More 


J 
will be said on the choice of this negative value (K) later (Sec. 4.5). 


or 


N 


K log B) 
(Ny ax 


log N 


Fig. 3.4. Revised range of measure for test data. 
Another feature of the selected measure is that it is non-directioml. 
That is, the value of the measure from document i to j is the same as 


from j toil. 


3.6 Document Networks 
It has been suggested that measures of the relatedness between docu- 
ments should be metrics?”. This would require that a measure C exhibit 
the following properties; 
(1) C(x,x)=0 
(2) C(x,y)>O (if x4y) 


(3) C(x,y)=C(y,x) 


(Lh) C(x,y)+C(y,z) 2 C(x,z) 

The measure under consideration does meet property (3). It might 
conceivably be made to fit properties (1) and (2) through some type of 
normalization or restriction. There appears to be no way to make it 
have property (i), the triangle inequality. Indeed, it would be rather 
disturbing to this author if it did have property (li). 

Bar-Hillel has pointed out in the comment cited in Sec. 2.21] that 
many of the important aspects of a document collection (except physical 
location) cannot be made to satisfy the triangle inequality and cannot, 
therefore, be represented by metrics. His conclusion was that measures 
derived from these features (joint usage, common citation,etc.) are use- 
less. Our conclusion is that such measures should not be required to be 
metrics. 

The idea that a metric space is the appropriate model for a docu- 
ment collection is rejected here. If one desires a model to aid in his 
mental picture of a document collection, a simple network is suggested. 
Each document can be considered a node and the link between two nodes 
can be assigned the value of the measure of relatedness between the 
corresponding documents. It has already been pointed out that the 
measure of relatedness chosen links every node (document) to every other 
node. It might, therefore, be easier to visualize the sub-network con- 
sisting of only positive links. This is the visual picture found most 
helpful to the author. 

Thus far we have considered the problem of generating a document 
network from a set of probabilities. Let us now consider the reverse 
process. If one draws a document network and arbitrarily chooses the 


values to be assigned to the links, can a set of probabilities be found 


Sl 
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which could have generated the network? his question is of interest 
because if there is only a certain class of networks that are realizable 
from sets of probabilities, then we need focus our attention only on that 
class. 

Theorem. For every document network (with the restriction 

that the values of the positive links be finite) there is at least 

one set of probabilities which could have generated it. 

Proof. The first step in proving this theorem will be to select a 
set of values for the elementary probabilities, p(x,-.-x,)- It will then 
be shown that the set selected yields the correct values for the links 
of the network in question and forms a valid set of probabilities (i.e. 
each value is in the range O to 1 and their sum is 1). 

Before proceeding let us define the following symbols. 

n: number of documents in the network (n22). 
(xpx;): value of the network link between documents x5 and X,. 


J 


c maximum value of C(x; x; i 


max” 
-C 
k: the lesser of the two quantities: (1/n) and (1/n)2 ™”. 


It will also be convenient to introduce at this point one additional 
notation convention. Let us allow the values of the variables in the 
P(x,-..x, )'s which differ from 0 to be specified by a statement following 
a colon as well as by superscripting. For example: 

P(x, ..-%, 2x, =) = Bee, see ideee) 

We are now ready to state the values for the elementary probabil- 

ities, p(x, ---x)). Four possible classes will be considered. 


(1) All p(x 2%) for which three or more x's are l. 


17? 
P(x) .-.x,3 at least 3 x's=1)=0 


(2) Ail P(x, -+-x,) for which two x's are l: 
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lil 
3 COxGx, 


P(x) ++ +X, 2X x81) =k 2 for all i,j (ifj). 


(3) All P(x, +x) for which one x isl: 
25 C(x}x;) 
p(x, ---x, 2x, =1)=k-k & 2 
jfi for all i. 
(4) The p(x, -.-x,) for which no x is l. 
0 0 2 y C(x; x5 
D(x; ++ +x )=1-nk+(k"/2) 2 
n ae 
i,jel 
ifj 
The motivation behind the selection of these values will become 
clearer as the discussion proceeds. It may be helpful, however, to note 
three of the underlying ideas at this point. 
(1) Each p(xt) is to have the same value. 
i 
p(x; )=k 


(2) The value of the p(xy)'s is to be chosen so that the p(xyx,)*s 


can be adjusted to give the desired (x;x;)'s. 
11 
C(x;x5) 


cs ae 
p(x; x, )=k 2 


(3) The only elementary events that are allowed to occur are those 
with zero, one or two documents in the subset of interest. 
Let us prove that the elementary probabilities as selected above 
generate the correct values for the links of the document network. Pre- 
liminary to doing this we will determine the values of the p(x;)'s and 
ly, 
p(x; x,) Se 
p(x?)= > D(x. s52k,.) 
i 1 n 


all p's for 


which x,=l 
i n 


= p(x) +-x 2x, =1) 2 P(x, ++ +X, 1x5 5x5=1) 
Ai 


Bk 


n 11 n li 
C(x7 x7) C(x7x 
Lye 3 ne D2 is 


j=1 cay 
fi iA 
plxy) =e for all i. 
de 
PUx:X, y p(x) -.-x,) 
all p's for 
which es a 
= p(x, + +-x 1%, x ,=1) 
thd: 
C(x-x7,) 
pbdtx d ato bd for all i,j (if). 
p(x?x?) 
6lax,) = log —= a 
coe 
SOG 
rae *j 


eye) 


i) 


C(xPx; for all i,j (ifj). 


In order for the set of values selected for the p(x, --.x_)'s to 


form a valid set of probabilities, their sum must be l. 


S= > p(x, -+-x,) 


over all x's 


n n 
. = Lad = 0 
= 2 De wey Ded p(x, -.-x 2x) l+p(x).. 
if3 
n 
C(xFx; (xt 
(7/2) 2, 2 : ae a 2 “) yes 
ifs ria rr 


fe) 
xn) 


yt) 
x4%5) 


We must also prove that the values selected for the p(x,---x)'s 
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are in the range 0 to 1. The values for the first class of probabili- 
ties, p(x, --.x tat least 3 x's #1), are all O and thus automatically in 
the range. The values assigned to the probabilities of the second class, 
P(x, + +6, 2X, 5x,"1), can be shown to be in the range by the following 


argument. 


Cc -C 11 
k<(a/nj2 ™*<(1/n)2 eat 
(xx!) 

w id S(1/n) and k<(1/n) 


11 
C(x; x) 
we ket S'S (a/n)? 
li 
C(x, x7) 
oxke2 ive. 


Next let us show that the values assigned to the probabilities of 


the third class, p(x, ...x,2x,=1), are in the correct range. 


n c(x?xt) 
kK Yo 2 PI Kk e<a/ndi 
jel 
jft 
n (xix?) 
k-k 2 2 3 JS x-k(n-1)(1/n)>0 
Jel 
jfi 


Finally let us check the range of p(xy+6.x0). 


B Cc xixh 
Lenk+(12/2) 2. 2 ue I 1-mce(3/2)(n)(ne2)(1/n)=1-BE - $a 
1,j=1 
if) 


n 
L-nk+(k*/2) 9 2 > 1-nk >1-n(1/n)=0 
i,j=l 


ifj QED 


o(xix5) 
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CHAPTER IV 


DOCUMENT CLUSTERS 


In the last chapter a measure of relatedness between documents was 
defined and a document network based on the measure was described. The 
next step to be taken is to formulate a definition for what constitutes 
a subset (cluster) of highly inter-related documents based on this 
measure. The purpose of such a definition is to provide the user who 
has requested information from the system with a set (cluster) of papers 
which is judged to be related to his interest. 

The exact form that a request for information can take and the pro- 
cedure used to translate a request into an answer cluster will be de- 
scribed in Chapter V. The way a cluster is obtained, modified, and 
stored in the experimental system devised for this project will be 
covered in Chapter VI. In this chapter we shall confine our attention 
to what constitutes an appropriate cluster of documents. ‘Two types of 
clusters will be defined and analyzed, and certain modifications will be 


described which make one of the definitions acceptable. 


4.1 Local Maximum Clusters 

The cluster definition which was first proposed and tested turned 
out to be the one which was eventually selected for this project. Let 
us formally define it and then discuss its.characteristics. 

In this definition and in the remainder of this thesis we will find 


use for the following set operators. 
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(J: Set union--(AUB) is the set of all documents in set A or in 
set B. 

f\: Set intersection--(Af)\B) is the set of documents in both set A 
and set B. 

@: Set inclusion--(ACB) means that the set A is included in the 
set B. 

X: Set complementation--X is the set of all documents not in X. 

Definition: Local Maximum Cluster 

A local maximum cluster is defined to be any subset of docu- 
ments Kye lag peeo%y )} for which both of the following conditions 
hold. : 


1. Every document x, in X is positively correlated to the 


1 


remainder of X. 


clx, (x, Mx; )] >o for all x,CX. 
2. Every document x not in x is negatively correlated to Xie 
C(x,X,) <0 for all xj,CX,- 


(Note that zero is arbitrarily classed as a negative value.) 

A local maximum cluster is so named because every possible single 
change (addition or deletion) to the cluster will result in a decrease 
in its internal correlation. The internal correlation C(X) of a subset 
X is defined to be the sum of the links whose ends both terminate in the 
subset. If x. is a cluster, then 

C(x, )>C(X,) for all X, which differ from X, 
by a single document. 

Five specific characteristics of local maximum clusters have been 


selected for discussion below. 


Size. The average size of the clusters produced by the local 
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maximum definition is very much a function of the correlation assigned 
to document pairs that have not co-occurred together (N, ,70). It has 
already been noted that although this correlation, K, is -qo by the 
formula, some finite value is more appropriate (Sec. 3.14). If K is 
made positive, then there will be only one cluster consisting of the 
total file. If K is made just slightly negative, then the clusters 
formed will be disjoint and consist of all documents connected by one or 
more paths of positive links. If K is made very negative, the only 
clusters will be those sets of documents wherein every document has co- 
occurred with every other document. 

Overlap. It is fairly obvious that local maximum clusters can over- 
lap. Consider the network of Fig. .1 in which all the links shown have 
the value +5 and all the links not shown have the value -6. The two 


local maximum clusters, (x)%5x3) and (x3) x2) overlap through x3. 


woe Links shown are +5 


(%5) (*s) Links not shown are -6. 


Fig. 4.1. Network with overlapping clusters. 


Coverage. The following simple theorem shows that local maximum 
clusters may not cover all the documents in the network. 
Theorem. Document networks exist which have documents that are 
not included in any local maximum cluster. 
Proof. First consider a document that has never co-occurred with 
any other document. Such a document does not prove the theorem because 
it is included in a cluster which consists of only the document itself. 


Now consider the network of Fig. 4.2. The only cluster is 
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(x9%3%) Xe) The document x, cannot form a cluster by itself since x 


a 2 


and X3 are positively correlated to it. It cannot form a cluster with 
X and x3 since x), and Xp are positively correlated to the set (x, x53) 


with the value 5+5-6=);, Thus x) occurs in no cluster. QED 


ea Links shown are +5. 
wf Ne Links not shown are -6. 
Fig. 4.2. Network with a document (x, ) in no cluster. 


Although local maximum clusters do not cover all possible documents 

in a network, one is at least assured of the following-- 
Theorem. Every document network contains at least one 

local maximum cluster. 

Proof. The proof will be constructive. A local maximum cluster 
can be formed by successively making single changes (additions or dele- 
tions) to a subset of documents as outlined in the following 3-step 
procedure. 

1. Pick a document at random as the initial member of the subset. 

2. If every document outside the subset is negatively correlated 
to the subset and every document inside the subset is positvely corre- 
lated to the subset, then guit. The local maximum cluster has been 
found. 

3. Otherwise either add a positively correlated document that is 
not in the subset or delete a negatively correlated document that is in 
the subset. It doesn't matter which is done, but only one change must 
be made. Now return to step 2, 


This procedure is assured of termination if the document set is 
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finite because step 3 always increases the internal correlation (sum of 


the internal links) of the subset being formed. 


There is, of course, an 


upper limit to the internal correlation of any finite set of documents. 


QED 


Structure. Local maximum clusters can form the type of hierarchal 


structure indicated by the following theorem. 


Theorem. A local maximum cluster can be a subset of 


another local maximum cluster. 


Proof. Again we can use an example to prove the theorem. In the 


document network of Fig. l.3 there are five local maxima: 


(x, x5); (xx3), (x3) )5 (x,%),), (x, x5%3%) )» 


fhe first four of these are subsets of the fifth. 


a Links shown are +5, 
,) 3) Links not shown are -6. 
Fig. 4.3. Network with hierarchal cluster structure. 


Relatedness. Now consider the problem of whether local maximum 


clusters form well related sets. 


Theorem. Totally unrelated subsets of documents can occur 
together in a local maximum cluster. By totally unrelated we 
mean that no document in one set is positively correlated to a 


document in the other set. 


QED 


Proof. This theorem can be proved by another simple example. The 


set (x1 x53%,,) of Fig. h.4 forms a cluster and yet there are no positive 


links between the set (x,x,) and the set (x,x) ). 


QED 
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ee 3) Links shown are +7. 
(*) 1) Links not shown are -3. 
Fig. 4.4. Cluster containing unrelated subsets. 


The inclusion of unrelated subsets in the same cluster is considered 
an undesirable characteristic for a cluster to have. The reason why this 
is so involves the design of the procedure of Chapter V. It was decided 
that the procedure could be greatly simplified if one were to assume 
that each request for information from the system has only one purpose. 

A person who has several areas of interest on which he desires informa- 
tion is expected to make a separate request for each area. It follows 
that if each request has a single purpose, then the document clusters 
which are to answer these requests should not be divisible into unrelated 


subsets. 


4.2 Subset Clusters 

In an attempt to keep completely unrelated sets of documents from 
becoming part of the same cluster, a definition was devised based on the 
addition of subsets or the deletion of subsets of documents as opposed 
to the single changes allowed in the local maximum definition. This 
definition was accepted as the one most suitable for this project for a 
number of months. In this section we shall describe it, note its charac 
teristics, and explain why it was finally discarded. 

Definition 1: Subset Cluster 

A subset cluster is defined to be any set of documents 
Hwy om ) for which both of the following conditions 


r 
hold. 


62 


1. Every subset of documents % included within x is 
positively correlated to the remainder of Xe 


clx, (XX, )] >o for all X,CX,. 


2. Every subset of documents X, external to x is 
negatively correlated to Xe 


c(x,x,)So for all xX. 


It is worth noting that Condition 2 of the local maximum cluster 
definition is equivalent to Condition 2 above. If each document external 
to x is negatively correlated to Xu then certainly all external subsets 
are negatively correlated to Xi Conversely if each subset is negatively 
correlated to Ku then, of course, single documents, being subsets, are 
also negatively correlated to Xue It should also be pointed out that all 
subset clusters are local maximum clusters but not vice versa. 

Next let us present an alternative definition of a subset cluster. 

Definition 2: Subset Cluster 

A subset cluster is defined to be any set of documents 

Xyr lrg seres%y for which both of the following conditions 

hold. 
1. The internal correlation of x as defined in Sec. 4.1 


is greater than the gum of the internal correlation of the dis- 


joint subsets of x created by any arbitrary partitioning. 


r 
o(x,)>)_ c(D,) for all partitionings in which 
1=1 (D,U+--UD, =X, and Dp, null set. 
2. The sum of the internal correlations of a and some subset 
x, external to x is greater than or equal to the internal correla- 


tion of the set formed by adding xX, to Xi 
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c(x, )+e(x, )2 C(x, UX,) for all X,CX,- 
Theorem. Definition 1 and Definition 2 for subset clusters 

are equivalent. 

Proof. The equivalence of the second conditions of both definitions 
is fairly obvious. The equivalence of the first conditions requires some 
verification. 

Let us assume that Cond. 1 of Def. 2 holds and partition the 
clusters into two subsets. 

c(x,)> C(x, )ee(x,, x) 
But: C(X, =C(X, )+¢(X, 1% )+cl (X, (x, 155 )] 

7. CL, )04,NF)1>0 

This last result is Cond. 1 of Def. l. 

Now let us assume that Cond. 1 of Def. 1 holds and partition the 


cluster into the disjoint subsets D,,...,D_. By Def. 1: 
1 r 


c{(D, (x, MD, )1>0 for all D,,...,D_ 
But: 


r Rad 
(x)= D ofp, )ea/2 2. f(D, )(x,ND,)1 


i=l isl 
ree C(x, >2 c(D, ) 


Thus if Cond. 1 of Def. 1 is true, Cond. 1 of Def. 2 is also. QED 

Let us discuss now some of the characteristics of subset clusters. 
The comments and theorems on cluster size, overlap and coverage, which 
were made in Sec. 4.1 for local maximum clusters, hold for subset 
clusters also with the exception that one is no longer assured of having 


at least one cluster in any given document network. 


6h 


Theorem. There exist document networks which contain 
no subset clusters. 
Proof. Examination of each of the ou possible subsets in the net- 
work of Fig. l.5 reveals that none of them satisfy the two conditions 


necessary for subset clusters. QED 


oo 
6 3 Links not shown are -5. 
@ ar ane, 


Fig. 4.5. Network containing no subset clusters. 
Structure. Next we note that a hierarchal structure is no longer 
possible with subset clusters. 
Theorem. No subset cluster X3 ean be included within another 
subset cluster Xo 
Proof. Let us assume that x and Xe are subset clusters and that 


x,CX,- Since Xx is a cluster and XxX, then by Cond. 1 of the defini- 


B 
tion: a= 
clx,(x,X,)] >o 
But since Xp is a cluster and (XX, CX, then by Cond. 2: 
clx,(x,X,)] <0 
which contradicts the previous inequality QED 


Relatedness. In the last section it pointed out that one of the 
difficulties with local maximum clusters lies in the fact that even com- 
pletely uncorrelated sets of documents can occur in the same cluster. 

It was for this reason that the subset definition was devised. In sub- 
set clusters one is assured by definition that no subset of the cluster 


is negatively correlated to the remainder of the cluster. 


Utility. The problem of coverage and hierarchy did not prove to be 
serious drawbacks to the subset definition of clusters. An extension to 
the definition was devised which allowed all documents to be in at least 
one cluster and provided for hierarchal relationships. This extension 
involved applying a bias to the links of the network. (See Sec. lh.) 
The reason the subset definition was finally abandoned was because no 
method could be found that would isolate subset clusters with a reason- 
able amount of effort. 

Consider for a moment the problem of checking Condition 1 of the 
subset definition. One must determine whether there is a partitioning 
of a set of documents which results in two subsets that are negatively 
correlated to each other. The brute force method is to try every parti- 
tioning. This would involve 2” tests for a set of n documents and would 
certainly be too much processing for an n of 20 or 30 even on a high 
speed digital computer. Several efforts:were made to devise a more 
efficient method. Although they were not entirely successful, it might 


be well to briefly document a couple of them. 


4.3 Finding Subset Clusters 

In the first method for finding subset clusters which was investi- 
gated, an effort was made to determine if a partitioning of a set existed 
which would result in two negatively correlated subsets. Such a parti- 
tioning is called a ‘split' of the set in the following discussion. 

In the other approach emphasis was focused on the small, very 
highly correlated subsets called 'kernels' within the document set and 


an attempt was made to combine and expand these until a split appeared. 
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4.31 Locating Splits 

We wish to devise a method which will determine whether a set of 
documents can be split into two negatively correlated subsets and to 
locate where such splits are. Some of the theorems that were developed 
for this purpose will be stated below. In the interests of brevity the 
proofs will not be given. The symbols used in these theorems are 


defined as follows. 


n- number of documents in S, the sets under consideration. 

a- number of documents in a subset A of 5S. 

b - number of documents in a subset B where B=S()A. (atb=n,A\JB=S) 
K - negative value assigned to links for which ers 


C_. - smallest value of the links for which N, j70: It will be 
assumed in the following theorems that Cade is positive. 
(See Sec. 3.5.) 
C nae largest positive link in the network. 
d - number of links in the set S which have the value K. 
Theorem 1: Consider the partitioning of a set of 
documents into the subsets A and B. 


Part A: Only those parritionings which satisfy the following in- 


equality can possibly result in splits. 


Cc. +4K| 
contorg(eee 4 


min 

Part B: A necessary condition for a partitioning to result in a 
split is that the partitioning must be crossed by at least 
r negative links where: 


(a)(o)(c_. ) 


oe ie 


>} 
H 


67 


Part C: A sufficient condition for a partitioning to result in 
a split is that the partitioning be crossed by at least s 


negative links where: 


(a)(b)(c_) 


max 


Ce IK] 


Example of Theorem 1: 


n= 20 
Ke -5 
Cain =h 
a= ho (4O of the 190 links are negative) 


By Part A of the theorem (a)(b) must be less than 90 to allow a 
split. Therefore partitionings with distributions a:b = 10:10, 9:11, 
8:12, and 7:13 cannot possibly result in splits. This immediately 
eliminates about 90%, of the possible partitionings as candidates for 
splitting the set. Unfortunately there are some 60,160 partitionings 
that still must be considered which is still out of the question. 

However if the O negative links are all bunched on only 5 of the 
nodes (8 per node), then by Part B of the theorem only 61 partitionings 
can possibly cause splits and these can easily be checked. 

If only 10% of the links are negative (19 instead of 40), then only 
partitionings with asb = 1:19 and 2:18 can cause splits. There are 210 
such partitionings and a check of these would also be possible. 

However in the general case Cain may be small, d may be large, and 
the negative links may not be so fortuitously arranged so that the parti- 
tionings which must be examined may still remain very large. 

Theorem 2 is concerned with the possibility of finding splits of 


the set S as it is being formed. 


Theorem 2. Consider the possibility of a set of documents 
being split by the addition of another document. Three statements 
can be made. 

1. If the new document is positively correlated to each item 
in the set, then no split can be created. 

2. If a split is created, it must be crossed by at least 
one newly added negative link. 

3. The sum of the newly added links crossing any split 
created must be negative. 

The next two theorems will help to determine whether the set S is a 
subset cluster when it contains one or more documents that are positively 
correlated to all of the other documents in S. 

Theorem 3. If a set of n documents has d or more documents 
that are positively linked to every other document in the set, 


then the set has no splits. 


n |K{ 
Cain + IkI 


Theorem . Assume that a set of documents has splits. Now 
remove all those documents that are positively correlated to 
every other document in the set. The reduced set must also 
have splits. 

The sum of the links connecting documents in the subset A to docu- 
ments in B is termed the cross correlation of the partitioning which 
created A and B. The following three theorems relate to this cross 
correlation. 

Theorem 5. The cross correlations of all possible parti- 
tionings of a document set are equal if and only if every link 


has the value 0. (n>3) 


Theorem 6. The cross correlations of all possible parti- 
tionings of a document set of size a:b are equal if and only 
if every link has the same value. 

Theorem 7. The average cross correlation of the parti- 
tionings of size a:b is C(S)(a)(b)/(3) where C(S) is the total 


internal correlation of the set. 


4.32 Forming Kernels 

Another method which was considered as a way for determining if a 
set was a subset cluster was to form highly correlated kernels within 
the set in question and thereby try to locate possible splits. The ker- 
nels might initially be those subsets wherein every document is posi- 
tively correlated to every other document. These sets could then be 
combined in various ways to see if any splits appeared. The following 
two theorems relate to this approach. 

The symbols used are as defined in the last section and as follows: 

Cc - average of the positive links of the set. 


ave 
D, - The {= disjoint kernel of the set S. 
Diu---URSS 
DAD, = null set for all i,j (ifj). 
Theorem. If the sum of the internal correlations of a set 
of disjoint kernels is greater than or equal to the total 
internal correlation of the set, then there is at least one 
split in the set. * 


In other words, if: > c(p, )2c(s) 
i=l 


then S has at least 1 split. 


Theorem. A sufficient condition for having at least one 
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split in a set is that the set contain at least d negative 


links where: 


uu. Biased Clusters 

In this section an extension or modification to the cluster defini- 
tions is proposed. It was initially devised in order that subset 
clusters could have a hierarchal structure. It was found to be a useful 
modification to local maximum clusters also. 

As a way of introducing the concept of a biased cluster, let us con- 
sider a large cluster (either local maximum or subset) of documents 
covering a rather broad field of interest. There will, of course, be 
users who want all of the documents in such a cluster, but what about 
the users whose interests are very specific and who want only a small 
portion of the cluster? As yet there has been no provision for such a 
narrowing of interest. Subset clusters and many local maximum clusters 
are not decomposable. We shall now present the theoretical basis of a 
method which will allow a cluster to be reduced to a more specific set 
or enlarged to a more general set. 

Consider a set of documents, We(wyy+-W)5 which forms a cluster 
in the overall document network. The problem of retrieving a portion of 
this cluster is regarded as equivalent to the problem of finding a 
cluster in the sub-library consisting only of W. 

In order to show how this might be done let us define a new sample 
space which has only a points instead of the ae points of the original 


sample space. Each point in the new space represents a possible parti- 


tioning of W. To distinguish between the probabilities of the two 
sample spaces, the probabilities of the old sample space will be given 
a subscript 'a' and the probabilities of the new sample space a sub- 
script 'B’. Let the probabilities assigned to the points of this new 
sample space be initially equal to the marginal probabilities of the 
corresponding events over the old sample space. 
Pg(¥,+--w,,) 7 P, (wy +o Ww) = ye p(x) 1X) 
over all x 
not in W. 
ease 0 0 : 

The marginal probability, Py(wyee¥h)s is the sum of the probabil- 
ities of all those elementary events in which none of the documents in W 
are in the subset of interest. Since these events are irrelevant when 
one is considering only the sub-library W, let us set Pa(wye- +82) equal 
to 0. Such a step requires that the other Pa(wy+.+W,)'s all be increased 
by a normalizing factor k. The final values for the probabilities 
assigned to the new sample space can now be specified. 

0 e) 
Pg (wy +--w,) = 0 
_ fe) 0 
Paw) ++.) = kp (w,.-.w_) for all Py (wy ---w, except Paw) -++¥,) 


.@) 0) 
k = /{i-p (wy...wi)] 


Now let us consider the effect of this change in the sample space 
on the correlation of any two documents in W, 
(whwt 
Pa S19 
1 1 
py (wy pi (¥5) 


11 
Pa (wy 
log 


pg (1 )Pg(w3) 


0 
C (ww) log 


" 


ded 
Ca (wyw5) 


dd 
(i )p (wis) 
= log 


(pa (wy Us dng (5) 
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cy (a) = c (ww) - log (k) 
Thus the correlations for the sub-library can be obtained by merely 
subtracting a constant or bias from the correlations for the full library. 
An alternative way to describe this approach is through the frequency 
counts used in making the probability estimates. Instead of considering 
all the available partitionings of the document file, let us consider 
only those partitionings in which one or more of the documents in W occur 
in the subset of interest. let us denote the counts based on this re- 
stricted set of partitionings by the letter M and use N for the original 
counts. 
N, = M for all i in W. 


N, .= M for all i,j in W. 


Now let us consider what happens to the approximation to C based on 
the probability estimates with the new frequency counts. 


MM 
3 (5) = log patie > 
MM, 
M Bi 
NWN, 


= log 


NN N 
Ses a) gg 

M 
N,N, 


= log 


~ 7,11 ~ 7,11 
Ca wyw, C,(wiw5) - log (N/M) 
Here again we note that we can in effect reduce the size of the 


library under consideration by merely subtracting a constant from each 


correlation value. 

In an analagous manner we can increase the size of the library and 
thereby obtain larger, more general clusters by adding some bias to each 
correlation in the network. 

We now observe that of the three measures which meet the criteria 
outlined in Sec. 3.2 (3,4, and 8) only Measure 8 allows this type of 
narrowing an broadening of the request range. Measures 3 and | are in- 
sensitive to any change in the size of the library or partitioning file. 

One final question arises concerning the biasing of the value K 
assigned to links for which Wns One could either let the bias affect 
all links equally or one could look upon K as a fixed value which is not 
changed by the bias. The latter approach was rather arbitrarily 
selected. 

We are now ready to define what is meant by a biased cluster. 

Definition: Biased Cluster 

A biased local maximum cluster has the same definition as 

a regular local maximum cluster, but a non-zero bias has been 

applied to the document network in which the cluster is formed. 

The same is true of a biased subset cluster. 

In summary, a simple, easy-to-use method has been suggested which 
will allow the size of clusters to be increased or decreased. Some 
arguments have been presented which show that the method has a sound 


theoretical basis. 


4.5 Final Cluster Decision 
The local maximum definition of clusters was reconsidered after no 
general method for finding subset clusters was found. It was pointed 


out in Sec. l.1 that local maximum clusters were considered unacceptable 
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because totally unrelated subsets of documents could be part of the 


same cluster. The following theorem and lemmas show that this diffi- 


culty can be avoided by selecting an appropriate value for K. 


During the remainder of this section it will be assumed that all of 


the links for which N; ,F0 are positive (See Sec. 3.5). If this condi- 


tion does not hold then the theorems and lemmas which follow can be 


restated in terms of links for which ci ies and links for which N, ;70 


instead of positive and negative links. 


Theorem. Each document in a local maximum cluster of n 
documents is positively linked to over half of the remaining 
n-1 documents if K<-C 5 

= “max 


Proof. By definition each document in a local maximum cluster is 


positively correlated to the remaining (n-1) documents in the cluster. 


Now if the positive links are smaller or equal in magnitude than the 


negative links, then it stands to reason that there must be more of the 


former to yield a positive sum. 


Lemma. Consider a local maximum cluster that is parti- 


tioned into 2 subsets, x and X,, with X, the larger if they 


B 
differ in size. If KS-C ax? every document in X, has at 
least one positive link to the other subset. 

Lemma. In a local maximum cluster with K¢-C ox there 
can be no subset that is totally uncorrelated (has no positive 


links) to the remainder of the cluster. 


The choice of K $C ax does not insure that a local maximum cluster 


will be free of splits and thus be a subset cluster. Subsets can still 


be negatively correlated to the remainder of the cluster. But it does 


insure that the rather strong type of relatedness expressed by the above 


two lemmas will exist for each partitioning of a local maximum cluster. 

Another advantage to choosing KS-C is that it provides the 
system with a very simple test of whether two documents can be in the 
same local maximum cluster. 

Theorem. If KS-C¢ then two negatively linked documents 

ean occur in a local maximum cluster together only if they are 

positively linked to at least one common document. 

Proof. Consider a local maximum cluster of n documents. Assume 
that there are two negatively correlated documents, x and x,, in the 
cluster. By the previous theorem Xe must be positively correlated to 
over half of the (n-1) other documents in the cluster. Since x, is not 
positively correlated to XB it must be positively correlated to more 
than half of the remaining (n-2) documents. This is true of sr) also. 
Thus they must be positively correlated to at least one common document. 

Next let us consider what value should be assigned to K to insure 
that KS-C oy In Sec. 3.5 it was shown that the largest value that the 
estimated correlation can possibly take is (log N) where N is the number 
of available partitionings of the document file. Thus if we make K equal 
to (-log N) we will be assured that KS-C oy 

So far some reasons have been given indicating that it might be 
expedient from a practical standpoint to make K equal to (-log N). Let 
us now consider whether this value for K is justifiable theoretically. 

It was noted in Sec. 3.5 that if the frequency counts are based on 
a finite number (N) of partitionings, then none of the probability 
estimates can fall between 0 and 1/N. This results in those correlations 
which might have been in the range -0 to (2-log N) being estimated to 


be -©O(or perhaps some value greater than (2-log N)). It was suggested 


TS 
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that those correlation estimates that are -Oaby the formula might be 
more appropriately adjusted to some finite negative value, K, since a 
correlation of -ogimplies that there is absolutely no chance of the two 
documents ever occurring together. 

Thus K can be considered an approximation to the correlations in the 
range -oo to (2-log N) and it would seem appropriate that it assume some 
value within that range. Consider also what value K should assume as N 
approaches 00. It is suggested that K should approach -o0 as N 
approaches oo since those document pairs for which N; still equals O in 
the limit do in fact never occur together and C(x;x; should be -co. 

There are two other consequences to making K=-log N that should be 
noted. It gives the correlation a symmetric range about 0 (-log N to 
log N). It also forces the correlation of documents that have never 
occurred together to always be less than the correlation of documents 
that have co-occurred [(-log N) <(2-log N)]. 

The local maximum definition is therefore selected for use in this 
project. Its definition is extended to include biased clusters and it 


is required that K = -log N. Hereafter we will refer to a local maximum 


cluster as just a cluster. 
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CHAPTER V 


SEARCH PROCEDURE 


The last component of the theoretical model is the procedure which 
transforms a request for information into the set of documents that com- 
prise the answer. The first step in describing the procedure will be to 
make a number of definitions. Then a list of features that a suitable 
procedure should have will be given. Finally the particular procedure 


developed for this project will be described and analyzed. 


5.1 Definitions 

Definition: Request 

A request for information from the system is defined to con- 
sist of two subsets of documents. One subset, Y=(y,0-05¥,)5 
contains those papers known by the user to be pertinent to the 
current search. The other, Z=(215+0+9%,)s contains those papers 
that are known to be not pertinent. The Y subset must be non- 
empty but the Z subset can be empty. 
Definition: Answer 

An answer to a request is defined to be a cluster of 
documents which includes the Y subset of the request and 
excludes the Z subset. 
Definition: Clustering Procedure 

Any algorithm which transforms a request into an answer 


will be termed a clustering procedure (sometimes hereafter just 
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called a procedure). We will consider for this project only 
clustering procedures which are iterative in nature and which 
on each iteration change the contents of a certain set of docu- 
ments, S=(s)5+++58,). Upon termination of the procedure S is 
to be the answer set. For most of the procedures considered 
here only a single change is made to S on each iteration. The 
S generated by the 4 iteration can be distinguished by a 
subscript (s,). 
Definition: Convergent Procedure 

A convergent procedure is one that terminates after a 
finite number of iterations. 
Definition: Inconsistent Request 

A request is said to be inconsistent if there is no answer 
cluster for any bias which satisfies the request. 
Definition: Ambiguous Request 

A request is said to be ambiguous if there is more than 
one answer cluster which satisfies the request. Note that one 
must consider all possible biases in determining ambiguity. 
Requests with empty Z sets will generally be ambiguous. This is 

because larger and larger answer clusters can be formed by increasing 
the bias. For example, the request of Fig. 5.1 is ambiguous having the 


following four possible answers. 


Answer Bias 

(y,) -oO —> -); 
(y,*,) -h > 3 
CA) 347 
(1%) %%3) +7 —>+ 00 
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4 
Pages e! 
3 ee. Links not shown are -5 
2) 
3) 
Y=(y,) 
zZ=( ) 


Fig. 5.1. Ambiguous Request. 


5.2 Attributes of a Good Clustering Procedure 


In this section we shall list some characteristics which the 
elustering procedure should have. It will be assumed that the definition 
of a cluster of documents as given in Chapter is suitable. If this is 
the case, then the basic objective of a clustering procedure would be to 
locate the appropriate cluster in an efficient way. 

1. Request Satisfaction 

If the request is unambiguous and consistent, then the procedure 
should produce the one cluster which satisfies the request. 
2. Request Modification 

If the request is ambiguous or inconsistent, then the procedure should 
be able to recognize this fact and should help the user to modify his 
geauset: This suggests that the procedure should allow close man- 
machine coupling so that information generated by the clustering process 
can be presented to the user for his examination and modifications to the 
request can be fed back into the system. 

3. Convergence 
The procedure should be convergent for every possible request and 


document network. Whether it is forming an answer cluster or determining 
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request ambiguity or inconsistency, it should never fall into a repeti- 
tive, non-terminating cycle. 
4. Minimal Number of Iterations 

The procedure should find the answer in as few iterations as 
possible. An excessively large number of deletions of previously added 


documents from the set being formed would be undesirable. 


5.3 Description of Procedure 
A description and flow chart of the procedure developed for this 


project will be presented in this section. An analysis of the procedure 
will be given in Sec. 5.5. 

Fig. 5.2 is a block diagram showing the overall structure of the 
procedure. Before attempting to describe each block in Fig. 5.2 in 
detail let us make some general comments about the procedure. 

There are three basic phases which the procedure can enter depending 
on the amount of bias required and the relationships of various documents 
and sets of documents. 

Phase I: No Bias 

The procedure starts in this phase, remains in it as long as no bias 
is required, and returns to it from Phase II if at some point the bias 
can be reduced to zero. The documents considered for addition to S in 
this phase are those (positive to S) which keep each yy in Y positive to 
S (or at least increases its correlation to S) and keep each z, in Z 
negative to S (or at least decreases its correlation to S). Of these 
candidates the one with the highest correlation to S is selected for 
addition to S. If at some point there are no more documents that are 


positive to S,then the procedure terminates. If there are documents 


- Initialization 


A) 


Condition 1 of Cluster 
Definition 


Are there documents in 3 
S that are negative to S? yes 
(Y's excluded) 


) no 


Delete a document 
from S. 


Condition 2 of Cluster \ 
Definition 

Are there documents not 

in S that are positive 

to S? (Z's excluded) 


Add a document 
to S. 


6 Is Y included in S and 


Z excluded from S? 


7 


Are there request docu- 
ments in trouble by the 
above test which are in 
both Y and 2? 


Change the Bias. 


2 


Mark request as 
inconsistent. 


©) 


Fig. 5.2. Qverall Flow Chart. 
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that are positive to S but none of them meet the above conditions with 
respect to Y and Z, then it is concluded that some bias will be needed 
and Phase II is entered. 

Phase II: Bias 

In Phase II the bias is either made positive enough to keep all the 
y,'s positive to S or made negative enough to keep all the z,'8 negative 
to S. On each iteration those documents that are positive to S by the 
current bias are considered for addition to S. Of these candidates the 
document which requires the least bias when added to S is selected for 
addition to S. If at any time the bias becomes zero the procedure 
returns to Phase I. 

When there are no more documents that are positive to S, the pro- 
cedure either terminates or enters Phase III. Actually certain constraints 
are placed on the amount the bias can change on any one iteration. This 
means that all of the request documents may not be properly correlated to 


s (y,'s positive to S and z.'s negative to S) at the end of Phase II. 


i 
If they are all properly correlated to S (i.e. the request is satisfied), 
the procedure terminates. If they are not yet properly correlated to S, 
the procedure enters Phase III. 
Phase III: Monotonic Bias 

The purpose of this phase is to either make positive to S certain Vi 
that are not currently positive to S or to make negative to S certain Zy 
er are currently negative to S. This is accomplished by allowing the 
bias to move in only one direction while suitable additions and/or 
deletions are made to S. One may not return to Phase I or II from Phase 
III. Phase III and the procedure terminate when the y,'s and z,‘'s are 


i 
correctly linked to S. 


The detailed flow charts for the general blocks of Fig. 5.2 will be 
greatly simplified if we first define a number of symbols. 
Plow Chart Symbol Definitions 
g@: The null set. 
1: Set intersection operator. 
U: 


Set union operator. 


val 
4, 
® 
ct 


of all documents not in set S. (Complement) 

C: Set inclusion: ACB means set A is included in set B. 

Y: The set of all documents specified as interesting by the user. 

Z: The set of all documents specified as not interesting by the user. 

S: The set which is being formed into the answer cluster by the 
procedure. (YCs) 

P: The set of all documents positively correlated to the set S by the 
current bias. A document in S is in P if it is positively 
correlated to the remainder of S. 

Q: The set of documents included in P but not in S or Z. The document 
to be added to S will be chosen from this set. Q=PNSNZ 

fT: The set consisting of those documents in Q which will not require 
positive bias if added to S. Document ty is in T if when it 
is added to S it will do one or both of the following opera- 
tions for every document y; in Y. 

(1) Keep y; positive to the new S. cly,(sUt,)]>0 
(with 0 bias) 
(2) Increase the correlation of y, to S. cy ,t,)>0 
(with 0 bias) 
V: The set consisting of those documents in Q which will not require a 


negative bias if added to S. Document Vs is in V if when it 
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BL 


is added to 5 it will do one or both of the following opera- 
tions for every document 2, in Z. 


(1) Keep z, negative to the new S. clz (suv, )]1<0 


J 
(with O bias) 
(2) Decrease the correlation of z to Ss. C(z,v,) <0 
(with O bias) 

X: The set of documents which are candidates for addition to S. If 
there are one or more documents in Q@ that require no bias if 
added to 5, then X contains those documents. Otherwise it 
contains the documents that require a change in bias in only 
one direction. 

W: The set of documents which are candidates for deletion from S. A 


document w, is in Wif it is negatively correlated to the 


1 


remainder of S by the current bias and if it is not included 
in Y. 
Cw. (sNw, )]<o0 w,C SNY 
i J: i 
f: Number of positive links in the set S. (with no bias) 
g,: Number of positive links from document x, to S. (with no bias) 


d.: Bias required for the set (sUx, ). If x, CTW then d, is just 


i 
negative enough to keep each z, negative to (SUx, ). If 


x,CVNT then d, is just positive enough to keep each Yy 


i 
positive to (SUx,). If X=TNW then d, is made 0. 


BIAS: Current bias. 


b . 


ic Allowable change in bias if x, is added to S. 


i 


b, =minimum [ (a, -BIAS),1,10/(f+g, ),C(x,8)/(t+g, )] 


(C above is by current bias.) 


R: The set of documents in X that would keep the bias at O or allow it 
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to be reduced to 0 if added to S. 


|pras + v,| = 0 for all x,CR 


We are now ready to present more detailed flow charts for the 


blocks of Fig. 5.2. Fig. 5.3 covers block 1, Fig. 5.4 covers blocks 2 


and 3, Fig. 5.5 covers blocks and 5, and Fig. 5.6 covers blocks 6-9. 


A brief comment is made to the right of each step in these detailed flow 


charts as an aid to understanding them. More precise statements of 


their functions are given in Sec. 5.5. 


5.4 Earlier Procedures 


For historical purposes and for comparison and analysis, let us 


briefly document some of the earlier procedures which were considered. 


Procedure 1 
Briefly this procedure transforms a request into three subsets— 
A: the set of documents related to the request. 
B: the set of some of the documents not related to the 
request. 
C: a ‘limbo’ set of documents positively correlated to both 


sets A and B. 
Initially set A contains only those documents specified as 


interesting by the user, and set B contains those documents speci- 
fied as non-interesting. On each iteration all documents positively 
(negatively) linked to A(B) and negatively (positively) linked to 
B(A) are added to A(B). Documents positively linked to both A and 
B are placed in limbo while those negatively linked to both are 
ignored. All changes to the sets A, B, and C are made concurrently 


at the end of each iteration. 
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INTERACTION POINT 


Allow user to specify initial Y and 
Z sets. 


Put the interesting documents in S. 
Indicate that the procedure is not 
yet in the third phase. 


Start with an initial bias of 0. 


Fig. 5.3. Initialization 


Check if there are documents in S 
that are negative to the remainder 
of S. 


Point at which information can flow 

between the user and the system. 
(e.g. status of clustering procedure, 
data on particular documents, modi- 
fications to the request,etc.) 


Delete a document from S. 


Fig. 5.4. Condition 1 and Deletions. 


S«= SUX, 
Where C(x,S)>C(x,S) 


for all x,C R. 


S = SUx, 


Where [BIAS+», |<|[BrAs+o, | 


for all xc xX. 
BIAS = (BIAS+D, ) 


Where by is for the x, 


added to S. 
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Check if there are any more docu- 
ments positive to S. 


Check if there are documents posi- 
tive to S that keep (or try to keep) 
all the y's positive and all the 
z's negative. 


Check if there are documents which 
require a change in bias in only 
one direction. Note that TUV = 
(TNT) U(VN®) at this point. 


Load the set X with the candidates 
for addition to S. 


Check if one or more documents in X 
can allow the bias to drop to zero. 


Point at which information can flow 
between the user and the system. 
(e.g. status of clustering procedure, 
data on particular documents, 
modifications to the request,etc.) 


Add a document to S. The document 
x, is the x, in R for which C(x,S) 


is a maximum. (Based on current 
bias. ) 


Add a document to S. The document 
x is the x, in X for which the 


magnitude of the allowable new bias, 
|BIAS+,|, is a minimum. 


Change the bias if necessary. (Sign 
of b, is modified by PHASE III to 
allow change in one direction only. 


Fig. 5.5. Condition 2 and Additions. 


(c) 
ye 
yes n 
ANS 


te) yes 
WER 19. 
CLUSTER 
a) 
PHASE III 20. 
= NOT YET 
PHASE III = 21. 
DECREASE BIAS 
ONLY 
BIAS = BIAS + Minimum (1,10/f) 22. 


(A) 


INTERACTION \- 23. 
® 
4 
(A) 


Z= Zf\z 2h. 


Fig. 5.6. Phase III and other 


Tests for Request Documents 
in Teoubis 

Check if all the documents in 

Y are positive to S. 


Check if all the documents in 
Z are negative to &S. 


Termination of procedure. 
The answer cluster is S. 


Phase III_ Bias Change 
Check if this is the first 
time through Phase III. 


Set PHASE III switch to allow 
bias to change in only one 
direction. 


Make maximum change in bias. 
(The sign depends on the 
Phase III switch.) 


Inconsistent Request 


The request is considered 
inconsistent since the bias 
must go up and down simulta- 
neously. The user is informed 
of this fact and allowed to 
ask questions and/or modify 
the request. 


A document is chosen for 
deletion from Z if the user 
has not already modified the 
request. 


tests. 
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Procedure 2 
This procedure is the same as Procedure 1 except that only one 
change is made to set A or set B at a time. Thus, the most posi- 
tively correlated document is added and then the most negative docu- 
ment is deleted from each set. 
Procedure 3 
The basic difference between this procedure and Procedure 2 is 
that the criteria used to determine which document to add to set A 
or B is that it be most positively related to the original request 
instead of the current trial subset (S). Only those documents that 
are positively correlated to S are considered for addition. Within 
this set, selection is on the basis of correlation to the original 
request. 
Procedure 4 
This procedure attempts to combine the advantage of Procedures 
1 and 2. All documents positively correlated to either sets A or B 
(but not both) should be added to them on the first iteration as in 
Procedure 1. Subsequently only single changes are made to the sub- 
sets as in Procedure 2. 
Let us briefly note here why these earlier procedures were rejected. 
All of these procedures have a single subset B into which the documents 
considered not pertinent to the search are placed. This subset is 
treated just like the subset of pertinent documents and an attempt is 
made to form it into a cluster also. 
The difficulty with such an approach can be seen by the example of 
Fig. 5.7. By the above procedures the non-pertinent set B is initial- 


ized with Z=(2.2, « Further additions to B are not possible because x1 
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and X> are both negative to B. This is because the non-pertinent set is 


1 and X> are negative 


to B, one of them can be added to A. This will make X3 


to A and divert the procedure from the desired cluster. Basically what 


really not one cluster but two clusters. Since x 


and x, negative 


has happened is that the usefulness of the documents in Z has been 


hindered by requiring that they form a single cluster. 


Links show are +5 


Links not shown are -6 


Pig. 5.7. Example showing why non-pertinent documents 
should not all be grouped into one cluster. 

This would lead one to suggest that perhaps a separate cluster 
should be formed around each document in Z. There are some reasons why 
this would not prove useful in addition to the fact that it would eat up 
an excessive amount of effort in the formation of non-pertinent clusters. 
Consider the example of Fig. 5.8. Let us assume that x3 is added to A 
and Xp to B on the first iteration. Now on the second iteration x), can 
be added to A because it is no longer positive to B. The cluster 
(x,%,¥,) is again not found because the non-pertinent cluster formed 
around z, was (2, x0%¢) instead of (¥%3%),2,)- The point here is that 
the z,‘8 will be in a number of clusters and one does not know exactly 


which cluster to form around 2 in order to divert S in another direction. 
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Links shown are +5 


Links not shown are ~6 


¥-(y, ) 


Z=(z,) 


Desired cluster: (yx) %) 
Cluster to be excluded by Z5% (¥4*3%),24 


Fig. 5.8. Example of difficulty with forming clusters 
around non-pertinent documents. 


5.5 Analysis of Procedure 

Thus far the clustering procedure selected has been described and 
flow charted and a vrief explanation of the purpose of each block has 
been given. Also certain earlier procedures have been briefly sketched. 
We shall now analyze the effectiveness of the selected procedure in 


terms of the objectives of Sec. 5.2. 


5.51 Request Satisfaction 


The procedure selected and most of the other procedures considered 
to date operate by making single changes to a set S which initially con- 
tains the Y set of the request. Documents not in S that are positively 
correlated to S are considered for addition to S and documents in 8 that 
are negative to S are considered for deletion from S. Let us first 
settle the question of whether it is possible in general for a procedure 
of this type to locate an answer cluster if one exists. 

Theorem. It is always possible to transform a set S which 


initially contains only the Y set of the request into a (subset) 
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answer cluster if one exists by successively adding to S 
documents that are positively correlated to S. 
Proof. The proof of this theorem will be constructive. 
(1) Initialize the set S with Y. 
(2) If S coincides with the answer cluster A, the procedure 
can terminate. 
(3) Otherwise, consider the set of documents (AMS) yet to 
be added to S to form A. By the definition of a subset cluster in 
Sec. 4.2, (AMS) must be positively correlated to S and thus there is 
at least one document in (Af\S) that is positively correlated to S. Add 
this document to S and go back to Step (2). QED 
Note that this theorem is true only for subset clusters. We can 
show that it does not hold for local maximum clusters by the example of 
Fig. 5.9. The set (¥¥%1%,) forms a local maximum cluster,but it cannot 
be reached from the set Sy7(y1¥,) by the addition of documents positively : 


correlated to S&S. 


12 Links now show are -5 


Fig. 5.9. Local maximum cluster not accessible to procedure. 


Even when K$-C ox the theorem still does not hold for local maxi- 
mum clusters. In the network of Fig. 5.10 the set (¥1¥5%)%) again forms 
a local maximum cluster, but it cannot be reached from the set So*(y1¥,) 


by the addition of positively correlated documents. 


@) @ Links shown are + 
ae Links not shown are -5 


Fig. 5.10. Iocal maximum cluster not accessible to procedure. 
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Actually it may be a distinct advantage if procedures of the type 
being considered cannot reach certain local maximum clusters. It was 
noted in Sec. .5 that a procedure which produces subset clusters only 
would be preferred over one that results in local maximum clusters; but 
that such a procedure had not been found. The above theorem and comments 
show that procedures of the type selected can generate for a given 
request all of the subset clusters which satisfy a given request. In 
addition they may locate some (out not all) of the additional local 
maximum clusters which satisfy the request. 

Let us now observe that we have so far only proved that a suitable 
clustering procedure of the type suggested may exist. The 'constructive 
proof' of the theorem does not indicate how to choose the correct docu- 
ment to add to S in Step (3) if several documents are positive to S. 

One could, of course, try all possibilities. Let us represent these 
possible additions by a tree where each branch out of a node represents 
the addition of a positively correlated document to S. In the example of 
Fig. 5.11 there are three documents positively correlated to Yi two 


positively correlated to the set (y,x,), etc. 


Sp: (y,) 
8): (yx) (y,%5) (¥,*3) 
So? (¥14%) — (¥X%,) (1%) 


Fig. 5.11. Possible additions to S. 


A procedure which traversed all of the branches of such a tree 
would be assured by the preceding theorem of finding an answer (subset) 


cluster if one existed. However, one can quickly convince himself that 


9h 


such an exhaustive examination of all possible positively correlated 
additions is, in general, completely impractical because of the magni- 
tude of the task. What is needed is some way of determining which of 
the positively correlated documents should be added to S on each itera- 
tion. 

There will, of course, be cases where the answer cluster is 
obtained no matter which of the positively correlated documents is added 
to S on a given iteration. A simple example of a request and network 
for which this is the case is given in Fig. 5.12. On the first itera- 
tion one can add either x, or x, and still end up with the answer 


al 2 


cluster (¥4¥5%1%5)- 


sore Links shown are +l 


Fig. 5.12. Network where it does not matter which document 
is added to S first. 

However, in the more general case the choice of which document to 
add to S on each iteration is a very critical aspect of the clustering 
procedure. The answer to a request may not even be found if the wrong 
document is added to S on one or more of the iterations. As an example, 
consider the network and request of Fig. 5.7. If the procedure were to 
add x, to S on the first iteration, then (¥4x3%),)> the only cluster 
which satisfies the request, would not be found. 

Let us now describe the criteria used by the procedure of Sec. 5.3 
to decide which document to add to S on each iteration and note how 
these criteria might help in obtaining an answer cluster if one exists. 


In Steps 9-11 of Fig. 5.5 preference is given to documents that are 


positively linked to each y, (or else leave the Y¥, positive to S) and 


negatively linked to each z, (or else leave the z, negative to S). The 


i 
network of Fig. 5.7 serves as an example of how this preference might 
aid in obtaining the answer cluster. Documents X3 and x), are considered 
for addition to S before x, and x, and the answer cluster (¥,3%, ) is 
obtained. 

Steps 12 and 15 of Fig. 5.5 are for the purpose of minimizing the 
bias on each iteration and will be discussed when we talk about request 
modification and ambiguity. 

In Step 1) the document which is selected for addition to S is the 
one that has the highest positive correlation to S from among those docu- 
ments that have met all of the earlier criteria. 

The theorem at the beginning of this section shows that the only 
operation that a procedure needs to perform is the addition of positively 
correlated documents to S if the appropriate document to be added on 
each iteration can be determined. If, in fact, the procedure mistakenly 
adds on a given iteration a document which is not part of the answer, 
then it may still be possible to arrive at the answer if the procedure 
is allowed to also delete documents that have become negatively corre- 
lated to S (Steps 5-7 of Fig. 5.4). In the network of Fig. 5.13 the 


answer 8), =(¥1¥5%1%) is obtained even though S,*(y1¥5%3)- 


Cy) (3) Links show are +) 
LL 8-8 


(5) Links not shown are -5 


Fig. 5.13. Network showing that the procedure must be 
allowed to delete as well as add. 


Despite the above features which help in the choice of the docu- 


ment to be added on each iteration, there are still cases where the 
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procedure of Sec. 5.3 does not find an answer cluster even when one 
exists. Consider the request and network of Fig. 5.14. Documents Xx 
Xp» and X3 are linked to the documents in sets Y and Z by exactly the 
same values and are all candidates for addition to S on the first itera- 
tion. If the first document to be added is either X, OF X55 then the 
procedure finds the cluster (x, 59 ¥p) which is the only valid answer 
cluster for the request. If, however, X3 is added to S first, then the 
procedure reaches a point where no bias can be chosen which will simultea- 


neously keep Vy and Y> positive to S and x, negative to S and the request 


1 
is judged inconsistent. 


Links show are +) unless 
otherwise indicated, 


Links not show are -5. 


Only valid answer cluster = (¥1¥5%1%>) 


Fig. 5.14. Network illustrating the difficulties involved 
in knowing which document to add to S on a 
given iteration. 

The alternatives open to the procedure for the network of Fig. 5.1) 
are shown in the decision tree of Fig. 5.15. It should be pointed aut 
that all of the procedures discussed in this chapter decide which docu- 
ment to add to S on each iteration on the basis of the relatedness of 
the document being considered to the documents in the S, Y, and Z sets 
only. The inter-relatedness of the documents not in S, Y, and Z is not 
a factor in the selection. Indeed, from a practical standpoint, it can- 


not be used as a factor in the decision, since it would necessitate 
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considering the consequences of adding subsets of documents instead of 
single documents and for r documents under consideration there are as 


many as 2” subsets to consider. 


Sp: (y,¥,) 

S): (¥,¥5%,) (¥,¥5%,) (¥1¥5%3) 

Sot Cygya% Xp) y¥Q% XQ) (QRZ) 

i (74 %Q%3%4%5) (V4 ¥9%3%/%5) 
Inconsistent Inconsistent 


Fig. 5.15. Tree illustrating the possible additions to 
S for the network and request of Fig. 5.1). 

If the documents to be added to S are chosen on the basis of their 
relatedness to the S, Y, and Z sets only, then there is no way of deter- 
mining whether to add X19 Xp» OF x3 to 8) in Fig. 5.14. If one cannot 
tell beforehand whether to add Xj» Xo» OF X39 then perhaps a procedure 
should be devised that would at some later point back up and try another 
‘direction’ if S becomes inconsistent with the request. In other words, 
if x, is added to S in Fig. 5.14, perhaps one could on the fourth itera- 


tion remove a subset containing x, from S and add x, and Xo + Such a 


3 1 
step would require not only that the procedure be able to know which 
subset to remove but also that it remember all of the previous S sets 
so that it would not fall into a non-terminating cycle. This approach 
is also rejected as not being practical. 

The philosophy adopted for this research project is that for those 
cases where the procedure has difficulty in locating an answer, that the 
user should be coupled into the procedure to guide the process in the 


right direction. This is the reason for the interaction points in the 
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procedure. The user can step in before the addition or deletion of any 
document and over-ride the decision of the procedure by changing the 
request, if he decides the cluster is moving into the wrong area. In 
the case of Fig. 5.1) the user could easily obtain the cluster (¥4¥Q%)%> 


by specifying any member of the set (5%) %oX¢Xx7) to be uninteresting. 


5.52 Request Modification 

If the request as initially specified by the user is inconsistent 
or ambiguous, then some additional interplay may be needed between the 
system and the user so that it can be appropriately modified. Let us 
make some general comments about the suitability of the clustering pro- 
cedure for interaction with a user and then deal specifically with the 
problem of what particular type of interaction is needed to resolve 
request inconsistency and ambiguity. 

If a clustering procedure is to be used in close coupling with the 
user, then the process should be divisible into small units of effort. 
Each unit of effort should produce some useful piece of information that 
can be presented to the user and the user should be able to make changes 
to the request between these units of effort. 

The natural unit of effort is, of course, the iteration. The 
information produced by the iteration is the document to be added to or 
deleted from S. The change in the request can be the response of the 
user to the document presented. An iterative clustering procedure, 
therefore, lends itself very well to close supervision by the user. 

There are four interaction points shown for the procedure of 
Sec. 5.3. The initial specification of the request is made at Step l. 


In Step 6, which immediately precedes the deletion of a document from S 


(Step 7); the user is given a chance to examine the document to be 
deleted and to modify his request if he wishes to. In Step 13 the user 
is allowed to ask questions and change the request before the addition 
of a document to S. In Step 23 the request is judged inconsistent and 
the user is again allowed to obtain information from the system and 
modify the request. These four steps provide an interaction point before 
each change to S and on each iteration of the procedure. A description 
of the full range of questions that can be asked by the user at these 
interaction points will be given when the retrieval language is presented 
in Chapter VIII. 

Let us now consider the problem of determining whether a request is 
inconsistent or ambiguous. One test for inconsistency has already been 
given. The last theorem or Sec. 4.5 states that in order for two nega- 
tively correlated documents to be in the same cluster they must be posi- 
tively linked to at least one common document (if K<€-C ). Let us 
present three more theorems pertaining to whether two documents are 
assured of being in a cluster together or not. 


Theorem. Two documents x, and can be positively correlated 
—= *2 


1 
to exactly the same documents and negatively correlated to the 

same documents and still not be in the same clusters. 

Proof. Consider the example of Fig. 5.16. The documents x) and % 
are both positively correlated to x3 and x) and negatively correlated to 
Xp However, (x1%3%) Xp) forms a cluster which contains x and excludes 
Xp° The link between xy and X is dotted to show that they can be posi- 


tively or negatively linked and the theorem would still be true. QED 
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Pig. 5.16. Network with x and Xo not in the same cluster. 


Theorem. A document x, can be positively correlated to every 


document that a document Xo is negatively correlated to (and vice 
versa) and x, and x, can still be in a cluster together. 
Proof. The networks in Fig. 5.17 offer a proof of this theorem. 


x),) and yet the 
QED 


The documents xy and X) are in the same cluster (x) x5x, 


values of their links to x3 and X) have the opposite signs. 


or 


Fig. 5.17. Network with x, and x, in the same cluster. 


If one adds the restriction that K<-C oe then the above theorem 


is only true for positively correlated document pairs. The last theorem 


of Sec. 4.5 states that when KS-C two negatively correlated docu- 
ments can occur in a cluster together only if they are positively linked 
to one or more of the same documents. 
Theorem. Two documents xy and X» are assured of always 
being in the same clusters together if C(x1x5) is greater than 


the absolute magnitude of the difference in the correlations 


of x and X> to every possible subset of other documents. 
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Proof. To prove this theorem let us assume that x and X5 are not 
in the same cluster and then show a contradiction. Let us say that x 
forms a cluster with the set of documents A which does not include Xy as 


indicated in Fig. 5.18. 


Fig. 5.18. Network for proof of theorem. 


Since x, UA is a cluster: 


C(xjA) >0 
1 1 
el (x, )(AUx;)] ¢0 
Rearranging and combining these inequalities-- 
1 11 
C(x5A) + C(x}x5) So 
11 1 
C(x; x5) <-C(x,A) 
11 1 1 
C(x) x5) $C(x}A)-C(x,A) 
1.1 1 1 
o(x,x5) Sle(x7a) -C(x;A)| 
This last inequality is in conflict with the part of the theorem 


which states that for any A: 
11 1 
C(x}x5) > |e(xja) -c(x,a)| 6 


These three theorems give some indication of the difficulties 
involved in determining if two documents are in the same cluster on the 
basis of the links from those documents to the other documents of the 
network. The third theorem here and the last theorem of Sec. 4.5 would 


heip in some cases to determine whether documents can co-oceur in 
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clusters, but they have far from general applicability. 

It was, therefore, concluded that there was no easy test which 
could be initially performed to determine if the request was inconsis- 
tent or ambiguous. The tests which were devised consisted of attempts 
to find one or more clusters which satisfied the request and required at 
least as much effort as the finding of an answer for a valid request. 

It was decided that the procedure should not concern itself with the 
problems of request ambiguity and consistency at first but should assume 
that the request is valid and start trying to find the answer cluster. 
If during this process it was decided that the request was inconsistent, 
then the user would be notified of this fact. And if the user was still 
worried about ambiguity after a cluster had been found, then he could 
perform some further searching to satisfy himself that he had retrieved 
what he was after. 

It was further decided that the user should be given the option of 
being able to interact with the procedure on any or all of the itera- 
tions in order to monitor what was being retrieved and in order to 
modify the request if the situation demanded it. Thus a user who sus- 
pected his request to be ambiguous or inconsistent could carefully watch 
what documents were being added to S to make sure that he was obtaining 
what he wanted, while the user who had confidence in the validity of his 
request could let the procedure run to completion unattended. 

The rule which was followed in the design of the procedure of 
Sec. 5.3 was, therefore, to allow the user to interact at any point he 
wished to (and especially in cases where an invalid request was 
suspected), but to never require that he respond before the clustering 


could continue. Thus in Steps 23 and 2) of Fig. 5.6 the request appears 
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to be inconsistent. The user is given the chance of changing his 
request if he wishes. If no change is made, then the procedure picks a 
document to be deleted from Z so that clustering can continue. 

Also in the case of ambiguity the procedure is designed to find the 
most reasonable answer cluster it can for presentation and not to depend 
on the user to clear up the ambiguity. This is the purpose of Steps 12 
and 15 in Fig. 5.5. If two clusters with different biases are both 
valid answers to the request, then the one with the smaller bias is 
considered a better selection. Therefore, an attempt is made to make 


the bias as small as possible on each iteration. 


2:53 Convergence 

A major objective in the design of the clustering procedure is to 
insure that it will always terminate in a finite number of steps for 
every possible document network and every possible request. A procedure 
which occasionally drops into an infinite loop would, of course, be 
completely unacceptable. The possibility of an infinite loop comes 
about because of the fact that the procedure can delete as well as add 
documents to the set 8S. If on some iterations the set S has the same 
composition as it had on a previous iteration, and if the procedure 
does not remember all of the previous S sets, then a non-terminating 
cyclic behavior is possible. 

In Phase I of the procedure convergence is assured by the following 
theorem. 

Theorem. A procedure is convergent if the only types of 
changes made to the set S being formed are the addition of 
documents positively correlated to S and the deletion of 


documents negatively correlated to S. 


104 


Proof. The internal correlation of S is increased by the addition 
of a document positive to S. It is also increased by the deletion of a 
document negative to S. Thus C(S) increases monotonically as these two 
types of changes are made to S. This means that C(S) is larger on a 
given iteration than for any earlier iteration. Therefore the composi- 
tion of S must be different on each iteration. Since there are at most 
2" possible $ sets (for a network of n documents), there are at most 2” 
iterations of the procedure before it terminates. QED 

If the bias of the network is changed as it is in Phase II, then 
the above theorem no longer insures convergence. For example, the 
following steps might possibly be taken by a hypothetical procedure in 


trying to obtain a cluster in the network of Fig. 5.19. 


Links not shown are -6 


Fig. 5.19. Network which may cause a procedure to cycle. 


(1) S5=(y,) 


(2) s=(y,x) 0(x,8,)95 
(3) 8,*(y,x,%5) C(x,S, )=10 

(4) Bias =-2 to keep z, negative 

(5) 83=(y,*)%>x3) C(x,S, )=1 

(6) Bias =-3 to keep 24 negative 

(8) Bias =-2 to just keep 2, negative 


At this point the procedure returns to Step (5) in a never ending 


loop. 
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In order to avoid such cycles Phase II of the procedure selected 
(Sec. 5.3) synchronizes each change in bias with the addition of a 
document to S. If the document being added increases the internal 
correlation of S by k bits, then a decrease in bias is allowed which 
decreases the internal correlation by up to k bits. Thus the total 
internal correlation of S is still increased on each iteration and 
convergence is again assured. 

In the above example Phase II would combine (synchronize) Steps (3) 
and (l;) and allow the bias to still be -2 bits. Steps (5) and (6) would 
also be combined but the bias would only be allowed to go to -2.2 bits 
(b3=¢(x,5)/5). Step (7) would not be taken because x3 would not be 
negative. [(x,S)=0.6]. 

Thus far we have talked about the effect of decreasing the bias 
on convergence. An increase in bias does not reduce the total internal 
correlation and would not necessarily have to be synchronized with 
additions to the set. For purposes of symmetry, however, bias increases 
are placed under the same restrictions that bias decreases are. 

Finally, let us consider convergence in Phase III. Bias changes 
that are not synchronized with the addition of a document are now 
allowed, but the bias can change in only one direction. We have already 
shown that the clustering procedure is limited to a finite number of 
iterations for a given bias (by the above theorem). Phase III permits 
only a finite number of bias changes so the total number of iterations 


is finite and we are assured of convergence once more. 


5.5) Minimum Number of Iterations 
Those steps which ere taken to improve the proper selection of the 


document to be added on each iteration should also help to decrease the 
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number of deletions necessary on later iterations. We have already 


discussed the problem of choosing the correct document on a given 


iteration. 
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PART THREE: EXPERIMENTAL SYSTEM 


In the lest three chapters the basic components 
of the theoretical model were presented. ‘The next 
three chapters describe the experimental systen which 
model could be tested in a realistic environment. 

The four aspects ot the experimental system 
that will be covered are: 

Chapter VI: Computational Facilities and 

Data Base 


Chapter VII: File Structure 


Chapter VIII: Interaction Lenguege 
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CHAPTER VI 


COMPUTATIONAL FACILITIES AND DATA BASE 


There are two projects at M.I.T. on which this research endeavor is 
highly dependent. Project MAC supplied the computational facilities for 
the experimental phase of the project. The Technical Information Project 
supplied the document collection and data base on which the experiments 
were performed, In addition these two projects provided considerable 
other technical and general assistance. Since the computational 
facilities and data base are essential components of the experimental 


system, they will now be described. 


6.1 Computational Facilities 
The experimental portion of this project was designed for the 


Project MAC time-sharing systent. In this section we shall describe 
the MAC system and note some of its features that are of particular 
significance to this project. A more complete description of the 
objectives and characteristics of the MAC system can be found in the 
references!2 921 

Fig. 6.1 is an abbreviated diagram of the equipment included in 
the MAC system. Some of the more significant parameters of this equip- 
ment are given in Fig. 6.2. All of the equipment shown in Fig. 6.1 is 
physically located at M.I.T.'s Technology Square with the exception of 
the time-sharing consoles. Over 100 of these consoles are located at 


various places on the M.I.T. campus and can be connected to the 7750 
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through the M.I.T. telephone exchange. There are also MAC consoles at 
more remote locations. Indeed any TWX or TELEX telegraph station has 
the capability of being connected into the MAC system. Each console 
has a dual purpose. It communicates to the 7750 what characters have 
been typed on its keyboard and it also types out messages originating 
in the 709) that are routed to it through the 7750. 

In a time-shared computer a number of consoles can be simultaneously 
connected into the system and can independently obtain the services of 
the central processor. A limit is normally placed on the number of 
consoles that can be actively connected at any one time. The purpose of 
this limit is to help insure that those who are connected will be 
promptly serviced. The current limit for the MAC system is 30, but it 
varies periodically as changes and improvements are made in the system. 

One of the core storage banks (bank A) contains the time-sharing 
supervisory program. This program decides which of the users who 
currently want service has the highest priority. The program of the 
highest priority user is loaded into core (bank B) from the disc or 
drum and allowed to run for up to two or three seconds. Then the 
program is removed (swapped) and the new highest priority program is 
loaded and run. 

The IBM 1302 disc is used for permanent or temporary storage of 
programs and data. The data file to be described in the next section 
is stored on this disc as well as programs which arrange and structure 


it and allow the user to communicate with it. 
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Modified 
IBM 709) 
Central 
Processo 


ansmission 
ontrol Uni 


Time-Sharing ‘Consoles 
(IBM 1050's, Model 35 Teletypes, etc.) 


Fig. 6.1. Project MAC Equipment Configuration. 


Basic word size 36 bits 
Core storage operating cycle 2 microseconds 
(to read or write 1 word) 
Size of core storage banks A and B 32,768 words each 
1302 disc storage capacity . 34.56 million words 


(80,000 tracks of 432 words each) 


1302 Dise scan time 50-180 milliseconds to 
position on track; 
50. milliseconds to read 


track. 
Transmission rate to and from about 100 bits/second. 
time-sharing consoles 
Physical limit on number of consoles 112 


connected to 7750 
(The actual limit is lower) 


Pig. 6.2. Significant Parameters of MAC System. 
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6.2 Data Base 

The basic data needed to implement the theoretical model of Part 
Two is a document collection and a file of partitionings of that 
collection. The document collection selected is described in the next 
section and the final section of the chapter contains a discussion of 


the type of partitioning data that will be used. 


6.21 Document Collection 

The Technical Information Project at M.I.T. is currently accumu- 
lating a file of information on articles found in the physics periodical 
literature? This file covers about 26,000 articles from 25 different 
journals. Fig. 6.3 lists the names of the journals and the extent of the 
coverage in terms of volumes. The time period covered for each journal 
is 1 Jan. 1963 to the present. Note that all of the articles in the 
volumes listed are included. 

One can gain some appreciation of the extent of the coverage of the 
file by noting that the 25 journals account for over 50% of the articles 
that are abstracted for Physics Abstracts. 

The file is currently growing at the rate of 1500 articles a month. 
Periodically new journals are added to the file. Journals to be included 
are selected on the basis of a statistical analysis of their citations. 
This selection criteria is described more fully elsewhere . 

The information extracted for each article is the journal identifi- 
cation, volume and page number, title, author(s), author location(s), 
and coded bibliographic citations. Fig. 6.4 is an example of the infor- 
mation available in a given article. Fig. 6.5 summarizes some of the 


parameters of the file. 
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Journal Volume Number of 


Journal Code Range Articles 
1. Annals of Physics 384 21-36 275 
2. Applied Physics Letters 646 2-8 592 
3. Canadian Journal of Physics 55 hi-bh 531 
lh. Helvetica Physica Acta 3 36-38 202 
5. Indian Journal of Physics 164 37-39 165 
6. Japanese Journal of Applied Physics 612 2h 328 
7. JETP Letters 821 1-2 65 
8. Journal of Applied Physics 11 3-37 16443 
9. Journal of Chemical Physics 12 38 -Lk 3398 
10. Journal of Mathematical Physics 227 6 193 
11. Journal of the Physical Society of Japan 80 18-20 759 
12. Nuovo Cimento 17 27-0 1385 
13. Nuclear Physics 682 46-75 1529 
ly. Physica 21-29-31 359 
15. Physical Review 1 129-142 3713 
16. Physical Review (Series B) 199 = 133-140 1791 
17. Physical Review Letters hl 10-16 1585 
18. Physics Letters 9 3-20 2880 
19. Physics of Fluids 799 6-8 607 
20. Proceedings of the Physical Society (London) 3 81-87 738 
21. Progress of Theoretical Physics (Kyoto) 29 29-34 392 
22. Soviet Journal of Nuclear Physics 825 1 Lh 
23. Soviet Physics - JETP 669 16-21 1485 
2h. Soviet Physics - Solid State 310 5-7 814 
25. Soviet Physics - Technical Physics 790 6-10 898 
178 26,71 


Fig. 6.3. Journals covered by the physics periodical file 
of the Technical Information Project (March 20, 1966). 
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Physical Review 
Volume 136 
Page: 0001 
Spectral properties of a single-mode ruby laser. Evidence of 
homogeneous broadening of the zero-phonon lines in solids 
Tang, C. L. 
Statz, H. 
Demars, G. A. 
Wilson, D. T. 
Waltham, Massachusetts 
Raytheon Research Division 
JOO V102 P1252 JOO1 V112 P19l0 JOO] V128 P1726 
JOO1 V133 P1029 = JO11l VO3 P1682 JO11 VO34 P2289 
JO11 VO3L P2935 JO18 V187 POlo3 JO18 V195 P0587 
JO41 VOO6 PO106 =: JOWS ~-VOO9 PO399 =: J6L4S_-« VOO2._- P0222 


Search completed, 257 articles. 
1.99 seconds, 129.1 articles/sec. 


Fig. 6.4. Example of the information available on a given 
article. The last four lines are the coded 
citations (J=journal, V=volume, P=page). 


Number of articles available on the disc 26,471 
Time span covered Jan. 1963 to present 
Files key-punched but not currently on the disc: 

(1) Physical Review, Vol. 77-128 (1950-1962 ) 


(2) Journal of Chemical Physics, Vol. 28-37 (1958-1962) 


Average number of articles per track 6.7 
Average number of authors per article 2.02 
Average number of citations per article 12. 
Average number of words per title 8. 


Fig. 6.5. Parameters of T.I.P. data file (March 20, 1966). 
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Initially the information is key-punched on IBM cards. After some 


preliminary editing and correction it is then loaded on the IBM 1302 disc 


of the Project MAC computer. On the disc it undergoes more editing and 
is transformed into the format selected for permanent storage (see 
Sec. 7.1). 

The T.1.P. file has certain features which make it attractive for 
use by this research project. It is of sufficient size and interest to 
attract serious users. The articles covered contain a substantial 
number of citations which will be shown to be of particular use shortly. 
The generation of the data involves only clerical and mechanical opera- 


ters (i.e. no human indexing or evaluation is required). 


6.22 Partitions 

Some of the advantages to having a retrieval system based on user 
feedback were discussed in Chapter II. A basic objective of this 
project was stated to be the investigation of the feasibility of such a 
system. In Chapter III a particular form that user feedback could take 
was described. Basically it consisted of each interaction of a user 
with the document collection resulting in a partitioning of the docu- 
ments into a set of interesting documents and a set of uninteresting 
documents. 

This type of interaction was described so that one could better 
understand the motivation behind the choice of the sample space, 
probabilities, and other aspects of the theoretical model. Actually the 
theoretical model as developed in Chapters III, IV, and V in no way 
requires that the partitionings on which the probability estimates are 


based be generated by user interactions. Any type of partitioning data 


could be used, even data that has been arbitrarily contrived. Indeed, 
in the experimental system another type of partitioning was used because 
usage data is not readily available at the present time. 

Let us consider whether a change in the type of partitioning data 
employed by the experimental system will impair its effectiveness in 
testing whether a system based on usage data is feasible. First it can 
be observed that much of this investigation has very little, if any, 
dependence on the particular type of data being utilized. For example, 
the objective of a procedure of Chapter V is to find a cluster of 
documents. Its ability to do this could be examined and tested as well 
_ on the set of arbitrarily selected partitionings of a hypothetical 
document collection as on a set of partitionings generated by. the inter- 
action of a real user population with a real library. 

There are some reasons, however, why it is advisable to use a set 
of partitionings for the experimental system that: is not artificial and 
whieh resembles usage data as closely as possible. For example, the 
utility of the interaction points in the procedure are best tested by 
mee users. This, of course, requires a data base which produces 
results that a user would be interested in. Also the overall effective- 
ness of the system to produce useful results can be properly evaluated 
only in a realistic environment. 

With this objective in mind let us now consider what types of 
partitionings are available for the document collection described in the 
last section. There were five types of partitionings that were 
evaluated for this project. They consist of dividing the set of docu- 
ments into two subsets based on whether or not the documents-- 


(1) were written by a given author. 
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(2) contain a certain word in their titles. 

(3) cite a given article. 

(1) were cited by a given article. 

(5) occur in a given subject category. 

Thus by criterion (1) there sre as many partitions as there are authors 
in the file, with each author dividing the document file into those 
papers he wrote and those he didn't write. 

A detailed analysis of each of the above types of partitionings was 
conducted on one volume (vol. 128) of the Physical Review. Certain 
tests were also conducted on much larger parts of the document collection. 
Let us summarize the results of these tests and evaluate each of the five 
partitioning criteria. 

(1) Author Partitions. 

Difficulty was encountered in devising an algorithm that could 
determine if two author names referred to the same individual. A sur- 
prisingly large number of the authors were not consistent in the way 
they gave their names. Given names were sometimes supplied in full, 
sometimes represented by an initial, and sometimes left off altogether. 
The method which yielded the best results required an exact match of the 
surname and required that given names either match exactly or match on 
the first letter if one of the names was a single letter (i.e. an initial). 
We at first allowed a missing given name to be a match for anything, but 
this produced too many false matches. We, therefore, required that in 
order for a match to occur the number of given names had to coincide. 

Another difficulty was that roughly half of the authors were the 
authors of only one paper. This produced a large number of partitionings 


with only one document in the subset of "interest", with the consequence 
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that there were many of the papers that did not co-occur with any other 
paper by this method. 

A third drawback to this type of partitioning arises in those cases 
where an author changes his area of interest and publishes articles on 
unrelated subjects. 

(2) Word Partitions. 

If every title word is allowed to create a partition of the file, 
then practically every document will co-occur with every other document 
because of the common function words like "of", "the", etc. The alterna- 
tive is to try to identify and exclude from use function words. However 
‘there is no clear distinction between function words and keywords. It is 
fairly clear that certain words should be eliminated if co-occurrences 
are to be meaningful. However there is a large grey area of words such 
as "effect", "wave", “theory”, of “electronic” that in and of themselves 
cxga te little meaningful linkage, but in combination with other words 
eng very significant. The approach adopted for the tests was to elimi- 
nese all words that occurred in over 5-10% of the titles. This 
‘unfortunately eliminated the word "nuclear" while allowing words like 
“between” and "theory" to create partitions. 

A second problem in using word partitions is that there are a 
number of words which differ from each other by only a suffix (i.e. 
superconductor, superconductors, superconducting, superconductive, 
superconductivity). A table was compiled of O of the more commonly 
occurring suffixes of the title words in the document file. All of the 
words which differed from each other by one of these suffixes were con- 


sidered equivalent in creating partitionings. 
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An even more basic problem involves the use of synonomous words for 
the same concept. Some type of thesaurus would be necessary to link up 
articles with synonymous title words. It was decided that there are too 
many problems involved in the generation (or selection) and use of a 
thesaurus to warrant any effort in this direction in this research 
endeavor. 

(3) Cite-same Partitions. 

When two papers cite one or more of the same papers they are said to 
be bibliographically coupled. A number of studies have been conducted 
to analyze the characteristics of bibliographic coupling 28. These 
studies indicate that bibliographic coupling constitutes a very meaning- 
ful and important type of relationship between papers, especially in 
those document collections which have a sizable amount of citation infor- 
mation. In the T.I.P. file of Sec. 6.21 there are an average of 12 
citations per article and strict editorial policies make it easy to 
identify the articles that are cited. 

(4) Cited-by same Partitions. 

We note from Fig. 6.3 that the documents covered by the T.I.P. file 
have all been written in the last three years. Due to the time required 
to review and publish articles there is usually a period of at least six 
months between the time an article is published and the time citations 
to it begin to appear in the literature. And even after a span of two 
to three years over half of the articles in the Physical Review have 
still not been cited by subsequent articles in the Physical Review?!. 
Thus this type of partitioning will have a very small yield for the 
current T.I.P. file in terms of the number of documents that will occur 


in one or more subsets of interest and in terms of the total number of 
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co-oceurrences of articles that will be generated. 
(5) Subject Category Partitions. 

A subject index is published of the articles in the Physical Review. 
Each article is assigned to from one to four categories. These category 
groupings form another type of file partitioning. However, not all of 
the 25 journals have subject indexes and there is no general agreement 
on category headings among the indexes that do exist. Also the categories 
even within a single journal are constantly changing. 

In the beginning we decided to use all five of the above types of 
partitionings for the experimental system with the hope that each would 
add meaningful links to the resulting document network. However, the 
results of the above tests led us to conclude that the use of criterion 
(3) only would result in an adequate set of partitionings, and would 
avoid some of the problems encountered in using the other criteria. The 
final experimental system is, therefore, based on partitionings of type 
(3) only. 
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CHAPTER VII 


FILE STRUCTURE 


Thus far we have described the computational facility on which the 
experimental system operates and the data it uses. Let us now turn our 
attention to the problem of how the data should be arranged and structured 
for storage on the disc or in core. The first section of this chapter 
describes the general approach adopted in this project for the storage of 
data. Then four basic types of files are suggested and various comgina- 
tions of the basic types are proposed for the overall data storage 
system of the project. Certain arguments favoring the overall storage 
system that was selected are set forth. In the last section a brief 
discussion is presented of the type of data structure that would be 
appropriate for the data that has been loaded into the high speed core 


storage for processing. 


7.1 Description and Arrangement of Data 


A few rather general comments on the problem of data storage are in 
order before we launch into a description of the particular types of 
files considered for this project. 

It will be useful in our discussion to hink of the data to be stored 
as forming a tree-like structure. For example, the information file 
generated by the Technical Information Project (Sec. 6.21) can be sub- 
divided into journals. Each of the journals can be broken down into a 


number of volumes. Each volume in turn consists of some articles. 
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Within an article there are several information types--title, author(s), 
etc. Some of these information types may be further subdivided. For 
example, one can split the author information into the separate authors 


of the article. Fig. 7.1 portrays this tree structure. 


Data file 

Journal nodes 
Volume nodes 
Article nodes 


Info. types 


Separate authors 
Fig. 7.1. Example of tree-like structure of data. 


Each terminal node at the bottom of this tree represents a piece of 
data which must be stored, such as an author's name or a citation. Each 
parent node represents the grouping together of one or more pieces of 
logically related data. For example, a volume node groups together all 
the articles which are contained in that volume. 

Let us first consider a couple of problems involved in storing the 
data represented by the terminal nodes. Much of this data is variable 
in length. For example, titles might vary from 20-200 characters. Two 
ways of handling variable size data suggest themselves. One might use a 
special code or flag to indicate the end of the piece of data or one 
might explicitly store the length somewhere in the file. The latter 
approach was selected since one would always have to perform a search to 
determine the end of the data if a flag were used. 

In addition to knowing how long a piece of data is we must know its 


type or identification. For example, it is not possible, in general, to 
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determine whether a string of characters is a title or an author without 
being explicitly told this fact. If there were one and only one title, 
author, citation, etc. for each article, then the information type could 
be specified by the relative position or order of the pieces of data. 
However, for a given article there may be none or several citations and 
one cannot specify the information type implicitly by the order. 

Thus, in addition to storing the actual data for each terminal node, 
one must give two additional facts--length and type. The storage of 
these two additional facts is useful for the parent nodes in the above 
tree as well as for the terminal nodes. The type of information for a 
given node serves to identify that node from all of its sister nodes 
which are under the same parent node. The length information delimits 
the scope of the node. For example, a volume node would have for its 
identification the volume number, and for its length either the number of 
articles in the volume or the amount of storage occupied by those 
articles. Thus one can summarize the storage requirements of a data file © 
by the following two statements. An identification and length must be 
stored for every node in the related tree structure. In addition one 
must store a piece of literal data for each terminal node. 

The last question to be discussed here relates to the actual 
physical order in which data is to be stored. Let us use the example of 
Fig. 7.2 to describe the arrangement selected. One can flatten the tree 
of Fig. 7.2 out into the linear array of nodes shown in Fig. 7.3 such 
that no two connecting lines cross, and such that each parent node is to 


the left of its subnodes. 
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Article node 


(D) 
©) () (c) 
vis (y & op © “@ 


Fig. 7.2. Example used to show physical order given the data. 


OOOD®H © @® ® G&G 


Title Authors Citations 
Fig. 7.3. Linear arrangement of data in Fig. 7.2. 


This is the physical order in which the data is stored for this 
project. For the example of Pig. 7.3 the article identification and 
length are first (node D). This is followed by the code for title 
information, the title length, and the actual title (node T). Next is 
the code for author information and the length of the author data 
(node A). Then the information on a particular author is given (node A). 
This includes the author's identification (his position among the 
authors of the article), the length of his name, and his actual name. 
The description for the remaining nodes is similar. 

It may be of interest to note that the above approach is analagous 
to polish prefix notation. Consider the algebraic equation [A - (B+C)]. 
Its polish prefix form, -[A,+(B,C)], is obtained by flattening the tree 
of Fig. 7.4 such that no lines cross. If one equates terminal nodes to 
operands and parent nodes to operators, then our storage arrangement is 


the polish prefix form of the data. 
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Fig. 7.4. Polish prefix notation. 


7.2 Types of Files 
In this section four basic types of data files are described. An 
overall data storage system might consist of only one of the file types 


or it might include a combination of several types. 


7.21 Raw Data File 

The file of data generated by the Technical Information Project 
(Sec. 6.21) will be termed the raw data file. It eurrently has the 
‘polish prefix' structure described above. The precise substructure of 
a given article is shown in Fig. 7.5. The relative amoung of storage 
occupied by each of the types of information is given in the table of 


Fig. 7.6. 


raw data file 


journal nodes 
volume nodes 


article nodes 


Title 


Author(s) Location(s) Citation(s) 


Fig. 7.5. Structure of raw data file. 


article node (ident. and length) -5% 


title 21% 
authors 14% 
author locations 28 % 
citations 32 

100 % 


Fig. 7.6. Percent of storage occupied by each information type. 


7.22 Inverted Files 
An inverted file is a type of index to the raw data file. For 


example, one might create an inverted author file by extracting from 


each article the authors' names. These names could be alphabetized and 


the duplicates deleted. Such a file would have the structure shown in 


Fig. 7.7. In this figure nodes D 2+-DL are the identifications of the 


1 


articles written by Author Ays 


inverted author file 


author nodes 


articles 


Fig. 7.7. Structure of inverted author file. 


Inverted files have been created for title words, authors, 
locations, and citations. Because of a current lack of storage space, 
the inverted files cover only a part of the total raw data file. This 
partial coverage was found to be sufficient for experimental purposes, 


however. 
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On the basis of the experience gained with these partially completed 
inverted files, it is estimated that inverted files for the full raw data 
file will increase storage requirements by the percentages given in 
Fig. 7.8. 

title word file .... . 17.7% of raw data file 
Slthor files 2 uno 4s. eS ES eri 
Location file. .-% a4 «1508 oS 


citation file ...... 7.5% "" ues 


Total... 2 ee ew we 95.65% ' 


Fig. 7.8. Storage requirements for inverted files. 


There are certain additional steps that can be taken which will 
probably reduce the additional storage required to only about 70% of 
the raw data file. Thus adding inverted files increases storage require- 
ments by a factor of 1.52.0. It is suspected that the amount of 
storage needed for file inversion is a relatively standard factor for 
most types of information. Certainly the types of information found in 
the test file of this project (title, words, authors, locations, 
citations) varied markedly in their characteristics but still followed 
roughly this factor of two increase. 

Fig. 7.9 shows that the relative amount of storage required for an 
inverted author file decreases as the size of the file increases. The 
leveling off shown leads one to believe that an order of magnitude 
increase in the test file would not significantly change the percent 
increase in storage required for an inverted author file. A similar 


leveling off was found for title words. 
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Inverted Author File Size 
(Based on percent of raw data file size) 


16 


men 


12 


ae 8 
No. years of 


6 Physical Review 


in stack 


Fig. 7.9. Storage required for inverted author file. 
(For articles in Physical Review 1959-6) 

There is a good theoretical reason why the inverted files should 
require about the same amount of storage as the raw data itself. The 
reason is that the inverted files store the same information as the raw 
data file (except perhaps for the relative order of some of the data). 
Indeed one could reconstruct the raw data file from the inverted files 
by merely collecting together the title words, authors, etc. for each 
article. The one exception to the equivalence of the information found 
in the two types of files concerns order. One cannot determine from the 
inverted word file the order that the words originally had in the titles 
of the raw data file, but only which words belong to each title. Of 
course, some additional provision might be made so that inverted files 
contained order information as well as the article identifications. 
However the point here is that the two types of files should require 


about the same amount of storage. 
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7.23 Linkage Files 

A linkage file contains a description of a document network of the 
type described in Chapter III. The basic information needed to describe 
such a network consists of document node identifications and link values. 

The structure of a linkage file is shown in Fig. 7.10. For each 
document node in the network there is an entry in the filw which consists 
of the identification of the document along with the information on the 
links emanating from the node. The linkage information consists of the 
identifications of the other document nodes connected to the node in 
question along with the values of the connecting links. In such a file 
it is necessary to store only those links for which N, ,70 with the 


understanding that the value of all other links is K. 
Linkage file: 
Document nodes; 


Linkage node pairs: 


i oe era of documents linked 


Values of links 


Fig. 7.10. Structure of Linkage File. 


Note that the information on each link is specified in two places 
in a linkage file. For example, the value of C(x;x5) is stored in the 


entry for document x; and also in the entry for x This redundancy 


Aig 
makes it so that once the entry on a given document is located, one 
immediately knows all of the documents to which it is linked as well 


as the values of the links. 
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In an attempt to gain some insight into the size and characteristics 
of linkage files, a test was conducted on one volume (Vol. 128) of the 
Physical Review. Linkage files were created based on each of the five 
types of partitions discussed in Sec. 6.22. The results of this test 


are summarized in Fig. 7.11. 


File Size Percent of total 
Partitioning criterion on (Based on size of possible links 
which links are based Phys. Rev. Vol. 128) for which N, ,F0 
(1) Authors (estimated) 15% of raw data file 1/2% 
(2) Title words 58% " * " : 4% 
(for words occurring 
less than 20 times) 
(3) Cite-same 1 er a a 1 1/2% 
(4) Cited-by-same a " x small 
(Citations to v.128 
from v.128-133) 
(5) Subject Category 15% " * - - 15% 


Fig. 7.11. Table of linkage file sizes for vol. 128 of 
the Physical Review. 

Fig. 7.11 indicates that partitioning criterion (3) generates a 
network in which about 1 V/2% of the links have values other than K 
(i.e. N, jf0). This is for a single volume of the Physical Review. It 
would seem reasonable that this percentage would be somewhat less for 
the total document file. We shall assume in the analysis of the next 
section that approximately 1% of the possible links in the network of 
the total file have non-K values. This means that each document in the 
T.I.P. file is linked to about (.01)(26,000)=260 other documents on the 


average. 
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7.24 Request - Answer File 

The actual generation of this type of file was never seriously 
contemplated because of the immense amount of processing time and storage 
space that would be required. It is described here because it represents 
an extreme case to which we wish to make reference in the next section. 

A request~-answer file contains the answer cluster for each possible 
request. Its possible structure could be represented by Fig. 7.12. 
D 02 eDy in this figure are the documents contained in the particular 


al: 


answer cluster in question. 


Request-answer file 
Possible request nodes 


Answer cluster nodes 


Document nodes 


Fig. 7.12. Structure of request-answer file. 


Retrieval from this type of file would consist of a simple table 
look-up for the request and then presentation of the associated answer 


cluster. 


7.3 Storage Systems 

The overall storage system selected for this project could consist 
of any combination of one or more of the types of files described in the 
preceding section. For purposes of discussion and comparison let us 
suggest four types of storage systems. The first three were implemented 
and tested to some extent. System (2) is the one that was finally 


selected for this project. 
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(1) Raw data file only. 

(2) Raw data file and inverted files. 

(3) Raw data file and linkage file. 

(4) Raw data file and request-answer file. 

The raw data file is included in each of the four storage systems 
so that information on specific articles can be presented to the user at 
any time he wants it. For instance, a user might want to know the title 
and author(s) of an article that is about to be added to the set S. 

This information would be obtained from the raw data file. 

Each of the four suggested data storage systems could serve as 
base for the clustering procedure of Chapter V. There are some signifi- 
cant differences in the characteristics of the retrieval system that 
would result, however. Let us indicate some of the differences by dis- 


cussing four important characteristics of the resulting retrieval systems. 


7.31 Storage Space Required 

Since the raw data file is basic to all four systems, we will 
express storage requirements in terms of the size of that file. It has 
already been noted that the inverted files require about as much storage 
as the raw data file. If we make the assumption that 1% of all possible 
links have non-K values as was suggested in Sec. 7.22, then the linkage 
file for the TIP document collection would be about six times as large 
as the raw data file. If we assume that every request for information 
consists of only two documents of interest and every answer cluster 
contains 20 documents, then a request-answer file would be about 35 
times the size of the raw data file. Much more space would be required 
if larger requests were allowed. These figures are summarized in 


Fig. 7.13. 
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(1) Raw data only 100% of raw data file 
(2) Raw data plus inverted 200%" " " " 
(3) Raw data plus linkage 700%" 2" " " 
(4) Raw data plus request-answer . .3500%8" " " " 


Fig. 7.13. Comparison of storage requirements for the four 
types of data systems. 

7.32 Processing Time 

Let us next determine the average amount of processing time that 
would be needed to transform a request into an answer cluster for each of 
the proposed storage systems. By processing time we mean the amount of 
time allocated by the central processor of the Project MAC system to 
running the clustering program. The time spent in swapping the program 
in and out of core storage is excluded. The rario of the real time that 
the MAC user must wait to the processing time varies with the number and 
type of users on the system and can range from one to forty or fifty. 

The time required to access a piece of data on the 1302 disc is 
about 1/2 second. This includes both the time spent by the disc control 
supervisor and by the disc in locating and reading a track. Thus the 
request-answer system would require about a second in order to find sn 
answer, since very little computational or manipulative work is required. 

For a linkage file system at least 20 accesses to the disc would be 
required (for a cluster of 20 documents). This would involve about 10 
seconds of processing time in addition to some computational time which 
was found to be small in comparison. We pick 15 seconds as the average 
amount of time required to find a 20-document cluster if linkage files 


are available. 
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The amount of processing time required to find a 20-decument 
cluster with an inverted file storage structure has been found to 50-60 
seconds. This includes 60 or so accesses to the disc and a fair amount 
of manipulation and computation. 

If only the raw data file is available, then one must pass through 
the total data file two or three times looking for documents that are 
linked to the documents in sets Y, Z, and S. One complete pass through 
the raw data file takes 200-300 seconds. Thus the average processing 
time would be on the order of 600 seconds. Fig. 7.1) summarizes the 


processing time required for each of the four systems. 


(1) Raw data only 600 sec. 
(2) Raw data plus inverted 60 " 
(3) Raw data plus linkage 15 " 


(4) Raw data plus request-answer ...1 " 


Fig. 7.14. Average processing time required to find a 
cluster of 20 documents for the four types 
of storage systems. 


7.33 Updating and Editing 


Besides the processing time involved in answering requests there is 
a certain amount of time required for updating and editing the file, 
since it is constantly changing. For purposes of comparison let us 
consider the problem of adding 335 articles (50 tracks or raw data) to 
an existing file of 20,000 articles (3000 tracks). The time required to 
load and structure the raw data file will not be considered since it is 
common to all four storage systems. 

In order to update the inverted files one must extract the 


appropriate fields from the new raw data, sort them into the desired 
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sequences and merge the sorted data with the old inverted files. The 
current programs for doing this would take about 00 seconds for the 50 
tracks of data. The time needed for each information type is as follows: 
words - 90 sec., authors - 50 sec., citations - 210 sec., locations - 

50 sec. ‘The time for each process is as follows: extraction ~ 25 sec., 
sorting - 150 sec., merging - 230 sec. 

Consider the problem of updating a linkage file with the links based 
on whether or not two papers cite the same paper (partition type (3) in 
Sec. 6.22). Updating can be accomplished by the following steps. First, 
extract the citations from the 50 tracks of new articles. Sort these 
citations and compare them with the total raw data file to determine 
which articles are linked to each new article, During this comparison 
process generate a file of information on the new links. Sort this file 
and merge it into the old linkage file. The programs which were written 
to perform this updating process were only tested on small files of 
several hundred articles. Let us extrapolate the results and estimate 
how long it would take to update the linkage file for the case under 
consideration, Extracting and sorting the citations of the 335 new 
articles would take about 100 seconds. Matching the citations with the 
total raw data file would take about 1800 seconds and merging them into 
the old linkage file would require about 1200 seconds for a total of 
000 seconds. 

The amount of time required to update a request-answer file would 
be more of a guess than an estimate. It would take at least 7000 
seconds to rewrite the file and probably 10 to 100 times more to find 
all the clusters. These figures are tabulated in Fig. 7.15 for ease in 


comparison. 
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(1) Raw data only QO sec. 
(2) Raw data plus inverted hoo 80" 
(3) Raw data plus linkage ooo. Oo" 


(4) Raw data plus request-answer ... . 7000+ " 


Pig. 7.15. Processing time required to update a file of 2000 
articles with 335 new articles for each of the 
four storage systems. 


7.34 Flexibility and Compatability 


So far we have been mainly concerned with how much storage space 
and processing time is required for a system which finds answer 
clusters. Actually the process of finding clusters as proposed in this 
thesis is not considered to be the only retrieval tool which will be 
made available to the user. Rather clustering is looked upon as one 
possible component in a larger, more general retrieval system. It 
follows that the storage structure of the data should not be designed 
with just the clustering process in mind, but it should be chosen on the 
basis of its utility and adaptability to a large class of retrieval 
functions. 

Even if the data file for the experimental system were to be used 
exclusively for clustering, it would still be useful to make the 
structure selected as general as possible. One reason why this is 50 
stems from the fact that any experimental system is generally in a 
constant state of flux and any rigid or specialized data structure may 
soon be rendered obsolete. 

let us suggest that the following objective might yield a data 
storage structure which would provide an adequate base for a large 


number of different retrieval functions and at the same time strike a 
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suitable compromise between storage and time requirements. 

"The amount of storage required should be minimized 
subject to the restriction that at no time should one have to 
serially search through the total file to obtain a given 
piece of information. By serial search we mean a sequential 


examination of every article in the file.” 


7.4 Selection of Storage System 


From Sec. 7.31 and 7.32 it is evident that no data structure will 
at the same time minimize the processing time and storage space re- 
quired. Some type of engineering compromise is needed. This compromise 
must be influenced by such factors as the characteristics of the compu- © 
tational facilities to be used and by the type of retrieval service that 
is to be offered. One must also consider the costs involved in updating 
the file and how often updating is to be performed. The decision is 
further complicated by the fact that the structure selected should be 
compatible with other retrieval functions and flexible to change. 

A storage system consisting of the raw data only requires the least 
amount of storage space and the least effort to update. Its major draw- 
back is in the time required to answer a request. Even now with the 
current file of about 26,000 articles the time required to find informa- 
tion is generally too great to allow for close man-machine coupling. 

And if the file size were to increase by an order of magnitude, a system 
based on this structure would certainly be too slow. 

The linkage and request-answer files have excellent response times 
but require an excessively large amount of storage space and are very 


hard to update. In addition they are designed specifically for the 
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purpose of finding clusters and have little or no real value to other 
retrieval operations. 

The second type of data storage system consisting of the raw date 
file and the inverted files was the one selected for this project. Its 
storage requirements were less than double that required for the raw 
data file alone. The processing time required to find a cluster was 
high, but not so high as to exclude close man-machine interaction, and 
it appears that an order of magnitude increase in the file size would 
not appreciably increase these time requirements. Updating of the 
system could be done on a daily or weekly basis without consuming an 
excessive amount of computational effort. The structure is also useful 
in a large number of other retrieval operations as will become more 


obvious in the next chapter. 


7.5 High Speed Storage Structure 
So far in this chapter we have discussed how the data should be 


structured for permanent storage on the disc. A related problem con- 
cerns the form the data should take once it has been selected for 
processing and is loaded into high speed core storage. 

The approach that was used in the earlier versions of the experi- 
mental system was to convert the data to a "list" structure as it was 
loaded into core. This involves associating one or more address 
pointers with each piece of data. The pointers preserve the original 
sequence of the data without requiring that it occupy contiguous loca- 
tions in memory. One of the major ddvan tenes of such a structure is the 
relative ease with which the data can be re-arranged and with which 


particular pieces of data can be added and deleted. Some of the 
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programming languages that have been developed to facilitate the creatim 
and manipulation of list structures are COMIT, LISP, SLIP, and SNOBOL? 1,54 

It was later decided that the added flexibility obtained through 
the use of list structures was not, in general, needed for library-type 
data that remains relatively fixed. Indeed the processing time required 
to reformat the data into lists was considerable. Therefore the approach 
that was finally adopted was to leave the data in core in the same form 
that it was on the disc. 

It is actually easier to perform some of the operations needed in 
the formation of a cluster on this dise structure than it is to do them 
on the equivalent list structure. Take,for example, the calculation of 
the nN, 3% For the partitioning criterion selected this would involve 
the comparison of two tables of citations. The most efficient way that 
has been found to do this is to have the citation codes of each article 
in numeric order on the disc, and to make a single synchronous pass 
through the two tables tallying the number of matching entries. The 
time required to do this match if the data has a list structure would 
probably at least double. There are also certain other operations (e.g. 
binary or logarithmic searches) for which a list structure is not well 
suited. 

Por the final version of the experimental system a rather simple 
storage allocation system was adopted which kept track of the available 
free core storage. Through this system blocks of storage could be 
allocated, changed in size, or freed up for other uses. Reference to 
each block was through a numeric code so that the actual address of the 
block could change. This made it so that all the free storage could be 


kept in one contiguous block. Data from the disc was loaded into these 


blocks of storage and processed there. 

The S, Y, and Z document sets were also placed in blocks obtained 
from the storage allocator. It was later decided that this was a 
distinct disadvantage to the system because the sets were constantly 


changing and should have had the flexibility available from a list 


structure. 
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CHAPTER VIII 


INTERACTION LANGUAGE 


The description of the experimental system is now almost complete. 
The clustering procedure which is used in answering requests has been 
defined in Chapter V. The computational facilities and data base on 
which the system operates have been described in Chapter VI. In Chapter 
VII the way the date is structured was explained. 

The one aspect of the experimental system that has not been covered 
concerns the interface between the user and the system. In this chapter 
we will describe the language which permits the user to communicate and 


interact with the system, 


8.1 Background to Language 


As a way of introducing the language we will present in this 
section some of the general design objectives that were selected for the 


language and an example of a typical interaction using the language. 


8.11 Design Objectives of Language 
The first retrieval language developed for this project was 


designed specifically for clustering and bore little resemblance to the 
language used by the Technical Information Project programs in performing 
the more conventional matching functions (author, citation, and keyword 
searches, bibliographic coupling, etc.). It was found to be inconvenient 


and confusing to have to shift from one program and one language to 


uy) 


snother program and another language every time one wanted to shift from 
a clustering request to a T.1.P. request and vice versa. It was decided 
that the same general language should be used for both functions. This 
goal is related to the idea expressed in the last chapter that the 
clustering function should be considered a component of a larger re- 
trieval system (Sec. 7.34). Not only should the data structure be 
designed for the larger, more general system, but the retrieval language 
should also. In the remainder of the chapter the clustering and matching 
functions will, therefore, be treated equally. 

In addition to having adequate expressiveness for the current 
clustering and T.I.P. commands, it was considered desirable that the 
language be flexible enough so that it might be easily extended. to other 
types of retrieval operations. 

; A second objective of the language is that it should be easy to 
learn, use, and remember. It was decided that if the vocabulary and 
syntax of the language resembled normal English it would be easiest to 
learn and remember However, it was found to be rather tedious after a 
while to have to type a complete English sentence for ga request. An 
abbreviated version of the language was, therefore, developed for the 
experienced user which allowed mich of the vocabulary to be abbreviated. 
The abbreviated version was such that one could make a smooth transition 
from the full English request to the abbreviated request as he became 
more familiar with the system. An example of a complete request and the 
equivalent abbreviated request follow. 

"Print the authors and locations of all the articles cited by the 
article, Physical Review, volume 135, page 3." 


"p art loc of art cited by 1 135 1." 


y2 


A third goal of the language is that it be simple enough to process 
efficiently and quickly. Even a rather complex request in the language 
that was adopted takes much less than a second of central processor 


time to interpret. 


8.12 Example of Language 


In Fig. 8.1 is an example of an interaction that might oecur 
between a user and the system. The lines that the user types are under- 
lined. First he initiates the MARS (Machine Aided Retrieval System) 
program. We assume that the one fact the user knows is that he is 
interested in something about Langmuir probes. He could just as well 
have known an author or paper that interested him or perhaps a combina- 
tion of these. 

In the first command he asks for a list of those articles containing 
the word, "Langmuir", in their titles. Let us say that after examination 
of the list produced, the user decides that the papers by three of the 
authors are the most interesting. He now asks for all papers written by 
these three authors (that have not already been retrieved). 

Next we assume that the user selects two of the papers as of 
particular interest and wishes to form a cluster around them. Further 
he decides that one of the papers is definitely not what he wants and 
he, therefore, specifies that it is not of interest. A close interaction 
sequence follows with the system presenting papers that are about to be 
added to or deleted from the set S and the user deciding which are of 
interest and which are not. 

Finally a cluster is formed and the user stores it on the disc for 
future reference. He then analyzes its characteristics by making various 


lists of frequency counts. 
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RESUME MARS 
W 13b0.4 


PRINT TITLES AND AUTHORS OF ARTICLES CONTAINING THE WORD, ‘LANGMUIR’. 
17 ARTICLES IN SET 1. 


PHYSICA 

VOLUME: 30 

PAGE: 162 

STUDIES OF THE DYNAMIC PROPERTIES OF LANGMUIR PROBES I: MEASURING METHODS 
CARLSON R. W. 
OKUDA Tf. 
OSKAM H. J. 


NUOVO CIMENTO 

VOLUME: 29 

PAGE: 1,87 

EFFECT OF A R.F. SIGNAL ON THE CHARACTERISTIC OF A LANGMUIR PROBE= 
BOSCHI A. 


PRINT THE TITLES AND AUTHORS OF ARTICLES BY R. W. CARLSON OR T. OKUDA OR 
H. J. OSKAM BUT NOT IN SET 1. 


6 ARTICLES IN SET 2. 


JOURNAL OF THE PHYSICAL SOCIETY OF JAPAN 
VOLUME: 13 
PAGE: 1212 
DISTURBANCE PHENOMENA IN PROBE MEASUREMENT OF IONIZED GASES. 
' OKUDA T. 
YAMAMOTO K. 


a 


PRINT FOR DECISION THE TITLES AND AUTHORS OF ARTICLES RELATED TO PHYSICA, 
V. 30, P. 102, AND J. PHYSICAL SOCIETY OF JAPAN, V. 13, P. 1212, BUT NOT 


NUOVO CIMENTO, V. 29, P. L6T. 


TO BE ADDED: 

PHYSICS LETTERS 

VOLUME: 11 

PAGE: 126 

THE PLASMA RESONANCE PROBE IN A MAGNETIC FIELD. 
CRAWFORD F. W. 
HARP R. S. 


IS THIS OF INTEREST: YES 


TO BE ADDED: 
——_ ————Laaeane—eeeeeeeee 


END. 


wh 


SAVE SET 3. 

FILE SET 3 CREATED. 

END. 

PRINT THE FREQUENCY OF AUTHORS IN SET 3. 
23 AUTHORS IN SET 3. 


lh, OKUDA T. 
3 CARLSON R. W. 


ee 
ee ae aT 
END. 


Fig. 8.1. Example of possible user interaction with data 
using retrieval language. 
(Lines typed by user are underlined. ) 


8.2 Description of Language 


Two methods of describing the retrieval language have been 
selected. In the first the syntax of the language is described by 
means of a finite state (sequential) eacitae? In the second the syntax 
and vocabulary are defined by means of Backus normal (ALGOL 60) notation®’ 


The equivalence of these two descriptions is also shown. 


8.21 Finite State Machine Description 

There are a number of different methods that could be used to 
describe the retrieval language that was developed for this project. 
Perhaps the most appropriate way to describe the syntax of the language 
would be to present the same table that is actually used by the inter- 
pretive part of the retrieval system. Fig. 8.2 is the syntax table 
which has been extracted from a program listing. It is a tabular 
description of a finite state machin 35 The first colum contains the 


identifications of the various states. Colum two pertains to one of 


the languages used to write the system (it is the name of a MACRO in FAP) 


iS 


and is not pertinent to our discussion here. The third column contains 
the valid state transitions that can occur. For example, the entry 
(V,2) for Sl means that the machine will change from state Sl to S2 if 
the input signal is V (verb). 

Sl STATE ((V,2)(X,1)(A,1)) 

$2. STATE ((V¥,2)(C,3)(N,4)(L,8)(E,10)(X,2)(A,2)) 

83 STATE = ((V,2)(X,3)(A,3)) 

Sh STATE = ((N,L.) (C,5)(P,6)(X,4)(A,L)) 

85 STATE = ((N, 4) (X,5)(A,5)) 

S6 STATE ((N,7)(X,6)(A,6)) 

S? STATE ((P,6)(L,8)(X,7)(A,7)) 

s8 STATE ((L,8)(C,9)(E,10)(X,8)(A,8)) 

S9 STATE ((P,6)(L,8)(X,9)(A,9)) 

sio state () 

Fig. 8.2. Finite state machine description of syntax 
of retrieval language. 

Fig. 8.3 is the state diagram for the machine of Fig. 8.2. We have 
left off the self loops on each state due to the X and A inputs to keep 
from cluttering up the diagram. Also not shown is the sink state which 
the machine enters when the input sequence being analyzed has an invalid 
syntax. For example, if the machine is in state S, and the input signal 
is a P, then the sink state is entered. The initial or starting state 


of the machine is S,. The final or accepted state is S Thus an 


1 10° 
input sequence is considered to have an acceptable syntax if it trans- 


forms the machine of Fig. 8.3 from Sy to S\o- 


lho 


Fig. 8.3. Finite State Diagram for the Table of Fig. 8.2. 
(Transitions not shown go to an error or sink 
state. ) 
The input symbols of Fig. 8.2 and 8.3 represent classes of words. 

Fig. 8.4 gives the general titles and some examples of the classes. The 

interpretive procedure first classifies each word in the input statement 

into one of the classes and then checks the syntax by the Table of 

Fig. 8.2. In Fig. 8.5 we present a specific example of an acceptable 


and an unacceptable statement. 


Input Symbol Class Name Specific Examples 
Vv Verbs print, count 
N Nouns article, title 
P Prepositions by, of 
A Adjectives and Adverbs first, last 
Cc Conjunction and, or 
X Filler Words the, a 
L Undefined (literal) words Jones, laser 
E Terminator . (carriage return) 


Fig. 8.4. Classes of Input Symbols. 
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Statement: Count the articles by John Jones. 

Word classes: Vv x N P L L E 
States traversed: s) S, 8, 8) S6 Ss Ss S10 
Statement: Print the titles of articles and. 

Word classes: Vv xX N P N Cc E 
States traversed: 5) 8, S, 5), S¢ Ss, Sink State 


Fig. 8.5. Example of statement with acceptable syntax 
and statement with unacceptable syntax. 


Let us comment briefly on the purpose of each state in the diagram 
of Fig. 8.3. Preliminary to doing this it should be noted that there 
are generally three main parts to an acceptable statement (request): 

(1) Verd (states S, and 83) 

(2) Direct object (states S), and S.) 

(3) Modifying phrase (states S, 8g) 


State S, is the starting state of the machine. State S, requires that 


1 2 

each request begin with a verb describing what the system should do. 
The verb can be either simple (e.g. print) or compound (e.g. count and 
save). State S, excludes the possibility of a double conjunction 
between elements of a compound verb (e.g. print and or store). It also 
prevents the verb from ending in a conjunction. 

State 5), requires that the next part of a request be a list of one 
or more nouns signifying the type of information that is to be produced 
by the system. This can again be simple (e.g. title) or compound (e.g. 
title, authors, and locations). State 8, hes a purpose similar to 83. 

The last part of the request is the modifying phrase which 


contains the structure of the articles and other entities that are 
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specified by the user in making the request. States S, and Sy allow 
the request to have a complex structure with several levels of preposi- 
tional phrases modifying other phrases. For example, one could find 
the co-authors of a given author by the request: "Find the authors of 
articles by John Jones." - 

States 52 and Sg allow the user to specify some logical combination 
of a number of specific fields. For example: "Print the articles by 
John Jones and Robert Smith but not Joseph Adams." 

The E transition from Ss, to S10 
accepted that consist of a verb only. The LE transition between Ss, and 


allows for an abbreviated mode of reference to certain data (e.g. 


is so that certain commands will be 


Sto 
Print set 3.). Adjectives and adverbs can occur anywhere in a request 


and can modify verbs, nouns, etc. 


8.22 Backus Normal Description 
Let us leave the finite state description of the syntax of the 


language now and provide a more conventional description. The statements 


of Fig. 8.6-8 constitute the Backus normal (ALGOL 60) description of 


the language. In this notation "::=" means "is defined to be", " |" 


means "or", and "€ >" encloses the defined elements of the language?! . 
Two additional explanations are necessary for the Backus normal 
description of Fig. 8.6-8. All elements (words) in the statements are 
separated by one or more word separators (blanks, commas or periods) 
except in the definitions for <word> and <integer> where the characters 
have no separation. Adjectives, adverbs, and filler words can occur at 


any point in a request, but this fact is omitted from the description to 


simplify its statement. 
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Crequest> ::= <compound verb > < compound object > <€ compound modifier > 


<terminator> | Cabbreviated command 


compound verb) ::= <verb> | <compound verb > ¢ verb > | 


compound verb> conjunction > < verb S 


<compound object) ::= <noun> | <compouna object > < noun > | 


<compound object > < conjunction > < noun > 


<compound modifier) ::= < modifying phrase > | <compound modifier) 


<conjunction > € modifying phrase» 


modifying phrase) ::= < preposition> € compound literal) | 


preposition) <noun> modifying phrase> 


compound literal» ::= literal compound literal conjunction 

< > < »I< »« > 
< literal> Keene literal> €literal> 

abbreviated command ::= € compound verb» < terminator | 


compound verb») ¢ literal} < terminator> 


Fig. 8.6. Backus normal statements describing syntax 
of language. 
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<vocabulary word) ::= <verb>|<conjunction>|<noun>| <preposition>| 
<adjective>|<adverb) |<filler>| <terminator> 


q@erb> t:= Cfind verb> |<print verb |<delete verb> |<save verb> | 
<read verb> |<other verb) 


<find verb> ::= count | find | fetch |f | get |g | keep 

¢print verb> ::= list |print|p 

<delete verb> ::= delete 

<save verb) ::= dump | save | store 

<read verb> ::= read 

<other verb> ::= load | return | search | trace | unload | yes | no | skip 
<conjunction> ::= and | and not| but not | not | or 


noun) ::= <article noun>|<title noun>|<word noun>|<author noun> | 
€location noun>|¢citation noun> 


farticle noun> ::= art| article | articles | doc | document | documents | 
id | ids | identification | identifications | paper | 
papers 

<word noun> ::= keyword | keywords | wora | words 


<author noun> ::= aut | author | authors 
<location noun> ::= loc | location | locations 


<citation noun> ::= biblio | bibliography | bibliographies | cit | citation| 
citations | ref | reference | references 


<preposition > ::= article preposition>| <word preposition) | 
“author preposition) {Qlocation preposition | 
<citing preposition> |Xcited by preposition} | 
<set preposition) | clustering preposition> 


article preposition> ::= of | used by 
<word preposition> ::= contain | contains | containing | use | using 


<author prepositions ::= by 


location preposition> ::= at 
<citing preposition> ::= cite | citing 
<cited by preposition> ::= cited by 
<set preposition> ::= in 


clustering preposition: := related to | related by authors to | 
related by citations to 


<filler> ::= a | all | a11 of | an | any | any of | are | been | each | every | 
have | is] the| this | these| those | were | written 

<adjective> ::= first | last | most recent 

<adverbS 23:= «by frequency | for decision 


terminator) ::= +2 ( Pis a carriage return) 


Fig. 8.7. Backus normal statements describing vocabulary of language. 
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Cliteral> ::= article literal>|<word litera)|<author literal| 
location litera}| <set literal>d 

<article literal ::= <journa) <volumd <pag® 

<word literal> t:= (literal string) 

<author literal) ::= (literal string) 

Clocation literal) ::= <literal string> 

<set literal> ::= set <integer> 

<journal> ::= <journal name> | <alphabetic code | <numeric code> 

<journal name> ::= Phys. Rev. | Physical Review | sass | Physics of Fluids 


<alphabetic code> ::= phyrev | phyreb | «oe | spjetp 


numeric code> ::= Cinteger> 

<volume> ::= <word> CintegeD | <integer> 

<page> ::= <word> integer | Cinteger> 

<literal string> ::= <word string) | Xora string> 


(the first word string in this definition cannot include a 
vocabulary word.) 


<vord string> %= <word |Gord strind<word> 

<word> ::= <character> | ¢haracter ¢haractep | €haractep <character> 
€haracter>| ose 

<integer> ::= <aigit>|Gigit Gigid| Caigit) Gigid Gigit>|... 

<character> ::= Cletterm |<aigit>| <spectal character> 

Cletter> ::= alb|...|z 

<digit> ::= 0 J2| — |9 

<special character> ::= -| /| = |* | :| 3 | beste 


<word separator> ::= (blank) | | ‘ 


Fig. 8.8. Backus normal description of literals. 
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8.23 Equivalence of Descriptions 

The equivalence of the Backus normal definition of Sec. 8.22 to 
the finite state diagram of Sec. 8.21 can be shown by successively 
applying the four transformations of Fig. 8.9 to the statements of 
Fig. 8.6. Fig. 8.10 is a brief outline of the steps which would be 
taken in this process. One is referred to the literature for an 
explanation of the additional concepts (e.g. non-deterministic machines, 


equivalent states, etc.) introduced in this Figure. 


Backus Normal Finite State 
(1) Ars=B[¢ Op>O -> rao 
C 
(2) A::=BC orem —= O30 56 
(3) A::=aB|c @ vans © a or 
(4) Azz=Ba|c eo = 


Fig. 8.9. Rules for transforming Backus normal statements 
to finite state diagram. 


8.3 Interpretive Algorithm 

In this section we will describe how the retrieval system inter- 
prets and processes the language of Sec. 8.2. The discussion will 
initially cover some general aspects of requests and of the words that 
they contain. Sections 8.32-8.3) will describe the various functions 
that requests can perform (the verb), the types of data that can be 
generated as output (the direct object), and the structure that 


specifies the actual request (the modifying phrase). 
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Expansion of R: Expansion of (CM): 


(2) (2) La) 8) (0) ministic 


3 ; Reduction to 
M3 ¢ deterministic 
P L ma. 


chine.. 
Oy, eo 8) 6e= et 
: FP —_9=(98,9b 
6f=(6c,6d) 
Ta=(9a 


Substitution for (CM). 
(Null symbol 4 
necessary for 

y isolation.) 


Reduction to determin- 
istic machine: | 
h=(ha,hb) 

7b=(hb ) 


Combination of equiv- 
alent states: 

6e,6f) 

(7a,7o) 


Fig. 8.10. Outline of steps proving equivalence of Backus-normal 
and finite state descriptions. 
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8.31 Vocabulary and Literals 

A request consists of one or more lines of characters that the user 
types on his time-sharing console. The maximum length of a request is 
currently OO characters. The end of a request is indicated by a period 
followed by a carriage return. The request character string is initially 
broken up into words. Words are defined to be character strings 
separated by blanks, commas, and/or periods. There are two types of 
words: those found in the vocabulary table and those not found in the 
table. All words not found in the table are called literals. Their 
function is to specify the particular authors, title words, citations, 
etc. that the user wishes to designate in defining his request. The 
vocabulary words are for indicating the function and structure of the 
request. 

In some cases a user may want to use one of the words in the 
vocabulary table as a literal. For example, he may want to find all 
titles that contain the vocabulary word, “store". To do this he can 
explicitly specify the word as a literal by the use of the literal mark, 


a 


. For the above example the user would say, "print the titles of 
all articles containing 'store' ." 
Note that the retrieval system makes no distinction between lower 


and uppercase letters. The T.I.P. file does not contain information on 


whether a letter is lower or upper case either. 


8.32 Available Functions 
The verb part of each request specifies the particular operation or 
operations that are to be performed. For example, if the user wants the 


results of the search to be printed on his time-sharing console, he 


would use the verb, “print”. There are currently twenty-three verbs in 
the vocabulary and thirteen different functions that they specify. Let 
us describe five of the thirteen functions. 

(1) Seratchpad Storage 

One of the most useful features of the retrieval system is its 
seratchpad storage capability. Basically this involves the storage in 
core memory of various kinds of data for later reference. For example, 
one can create in scratchpad storage a file of all articles written by a 
given author by the command, "Find the articles by John Jones." After 
creating the set, the system tells the user its size and identification 
number (e.g. articles in set 3). Later on the user could find out 
what articles cite articles by John Jones by the request, "Print the 
articles citing articles in set 3,” or just "p art citing set 3." 

Each data set in scratchpad storage is currently homogeneous with 
respect to the type of information it contains. In other words one 
could not create a set that consisted of both author and citation data. 

Some of the verbs that create sets in scratchpad storage are: 
count, find, fetch, f, get, g, and keep. These words are completely 
equivalent so far as the system is concerned. 

(2) Console Print-out 

The verbs that will cause the data in question to be printed on the 
user's console are list, print, and p. A scratchpad set will also be 
automatically created (if the output is homogeneous and if it isn't 
already a set). 

The first line of each print-out consists of the number of items 
that will follow. Thus the user is always aware of the ultimate size of 


the listing and can interrupt it if he wishes. 
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(3) Delete Data Sets 

Sets or groups of sets can be erased from scratchpad storage by 
commands such as “Delete set 4", "Delete all sets." 
(4) Save Data Sets 

Any scratchpad data set can be placed on the disc for permanent 
storage by the verbs save, store, or dump. The form of the command 
would be: "Save set 2.” 
(5) Read Data Sets 

Data sets that have been stored on the dise by the above command 
can be written back into scratchpad storage by commands of the type: 


"Read set 6." 


The functions of some of the verbs can be modified by adverbs or 
adverbial phrases. Let us describe two such modifications that have 
been implemented. 

(1) Frequency Lists 

The print verb can be modified to list items in terms of their 
frequency of occurrence in the data from which they are extracted. For 
example, the command, "Print frequency of title words in Phys. Rev. 

Vol. 132." would produce a list of the number of times each word appears 
in the titles of articles in Phys. Rev. Vol. 132 (most frequent first 
and alphabetical within the same frequency). 

(2) Decision Print-outs 

The print verb can also be modified so that there is a pause after 
each item is printed out to allow the user to decide upon and respond to 
the item. This would be the command used, for example, by a user who 


wished to be coupled into the clustering procedure. For the command, 
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“Print for decision the titles of articles related to Nuovo Cimento 
Vol. 30, page 1.", the procedure would pause after printing the title of 
each article about to be added to or deleted from the set S and allow 


the user to place the article in the Y or Z set if he wished. 


8.33 Data Generated 

The second part of the request is the direct object of the verb. 
It is a list of the types of information (nouns) that the user specifies 
he wants in the system's response to the request. Fig. 8.7 indicates 
six different types of nouns that can be used for this purpose (article, 
title, word, author, location, and citation nouns). The correspondence 
of these words to the various types of data found in the T.I.P. file is 
fairly obvious. Any combination of these types of data can be printed 
on the user's console, but only one type can be put in scratchpad 
storage for a given request. The form of the data as it is printed on 
the console is shown in Fig. 6.l. ‘The data placed in scratchpad has the 


single level structure indicated by Fig. 8.11 (see Sec. 7.1). 


Set Node: 


- 2 ees © 


Author Name Nodes: 


Fig. 8.11. File structure of data in scratchpad storage. 


8.3 Request Structure 
The third and final component of the request is the phrase which 


modifies the direct object of the verb. It consists of a series of 


prepositional phrases which either modify the direct object itself or 
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else modify the noun object of one of the other prepositional phrases. 
Let us define the structure of this modifying phrase and describe how it 


is interpreted. 


8.341 Determination of Literal Type 

The object of each preposition can be a noun or a literal. In the 
case of a literal some indication must be given of its type, since there 
is no intrinsic difference between most of the types (e.g. a word 
literal might look exactly like an author literal). The first preposi- 
tion to the left of a literal is currently used to determine the type. 
Fig. 8.12 lists the literal type which is assumed to follow each preposi- 
tion. For example, any word not in the vocabulary that follows the 
preposition, "by", is assumed to be an author's name. 

The one exception to this is the set literal which can be the 
object of any preposition. It is distinguished from other literals, not 
by the preceding preposition, but by the word, "set", at the beginning 
of the literal. 

There is one additional way of indicating the literal type which has 
been partially implemented but is not described in Sec. 8.2. This 
involves the use of a noun between the preposition and the literal. An 
example of this would be the phrase, “with the word, phonon", which is 
acceptable and identical to the phrase, "using phonon". A change such as 
this would become essential if the number of data types increased sub- 


stantially, since there would not be enough suitable prepositions. 
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Preposition Type Type of Object 

article preposition> €rticle noun), €itation noun>, @rticle literal> 

Cword preposition> word nour, word literal 

author preposition> author noun>, @uthor literab 

Qlocation preposition> ocation noun, location literal 

iting preposition “article noun>, Citation noun, @rticle litera} 

€ited by preposition> <rticle noun>, €itation noun, @rticle literal 

Get preposition> get literal> 

<€lustering prepositiony €rticle noum, ¢itation noum, €rticle literal 
Fig. 8.12. Valid types of objects for each preposition class. 


(Set literals are valid objects for any preposition 
and are not listed.) 


8.342 Form of Literals 


After the general type of information that a literal contains is 
“determined, one must next interpret what specifically is meant by each 
diteral. To this end let us describe the conventions which govern the 
form that each type of literal can take. 

Article literals generally consist of three parts: the journal, 
volume, and page. The journal can be specified by using the full title, 
the standard abbreviation of the title, or a special alphabetic or 
numeric code. The volume and page number can each consist of an integer 
or a word followed by an integer. Some examples of acceptable article 
literals are: 

Physical Review, volume 128, page 1 
Phys. Rev., vol. 128, p. 1 
Phyrev v 128 pl 

1 128 1 
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The volume and page number have been made optional so that one can 
refer to all articles in a given journal or in a given volume by a 
single literal. 

Each word literal should consist of a single word. If one wishes 
to search for a phrase of two or more words, he should use two or more 
literals (e.g. "print titles of articles using thin and film."). 

A word literal represents (matches) not only the word in the file 
which is identical to it, but also all words to which it is the prefix. 
Thus the command, "Get the art using supercon." would get all articles 
with titles containing superconductor, superconductivity, etc. 

If one does not want prefix matching, he can use a "#" to designate 
an explicit blank. The command, "p art using laserx.", would not 
produce those articles whose titles contain the word, "lasers". 

Author literals are to be written with the surname last (e.g. 

John H. Jones). A literal that consists of a surname only will retrieve 
all authors with that surname. A literal containing one or more given 
names will match those author names in the file for which the surname 
matches exactly and for which every given name in the literal is the 
prefix of the corresponding given name in the file. Thus, "p art by Al 
Jones.", would print all articles by "Albert Jones," "Alden Jones”, 

and "Allen S. Jones". 

Location literals must be given in a request exactly as they are 
found in the data file if retrieval is to be accomplished. 

Set literals consist of the word, "set", followed by the identifica- 


tion number of the desired set. 
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8.343 Action Initiated by Each Preposition 

Each prepositional phrase in a request initiates a file search 
(table look-up) in an appropriate data file. If the object of the 
preposition is an author, location, word, or citation literal, then the 
file used is the corresponding inverted file. If the object of the 
phrase is an article literal then the raw data file is used. 

The information obtained from an inverted file is, of course, 
always a list of article identifications. The type of information 
obtained from the raw data file is determined by the type of noun that 
is modified by the prepositional phrase in question. For example, in 
the command, "Print authors of Phys. Rev. 128 1.", the table look-up 
for the "of" preposition would be in the raw data file and would select 
the author information. 

The set of articles (or other data) produced by each table look-up 
can in turn be the object of another preposition and another table look- 
up. Consider the request, "Print the titles of articles cited by 
articles by John Jones." The procedure first looks up the articles by 
John Jones. Then it finds the articles cited by the articles by John 
Jones. And finally it retrieves and prints the titles of the articles 
so obtained. Note that each of the three prepositions, of, (cited) by, 
and by initiated a particular type of file search. 

There are two types of prepositions that do not cause. a table look- 
up in a file. A clustering preposition performs more than just a table 
look-up. The procedure of Chapter V is executed, resulting in the set 
of articles of the appropriate cluster. 

The set preposition does not initiate a file search but produces 


the input set as its output (a unitary transformation). Thus in the 
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request, "Print the title of articles in set 4.", the preposition, "in", 
merely passes on the articles in set lh to the next preposition, "of", 


which looks up their titles. 


8.344 Logical Operations 

The results of the table look-ups (or clustering) for two or more 
prepositional phrases can be combined by the standard logical operations 
(and, or, not). Consider, for example, the request, "Print the articles 
by John Jones and by Robert Smith or by Charles White but not by David 
Allen.” The logical operation performed can be represented by the 
equation [((J.J.(\R.S.)UC.W.)MD.A.] where the initials J.J. stand for 
the set of papers by John Jones and DA. is the set of papers not 
written by David White. It will be noted that the logical operations 
are performed from left to right through the request in the same 
sequence in which the user typed them in. It was thought that this 
might be a more useful convention for a system that is closely coupled 
to the user than to have a parenthesized system with a hierarchy of the 
types of operations to perform first (as in MAD,FORTRAN, etc.). 

Any arbitrarily complex logical structure can be obtained by this 
kind of approach (without having to use parentheses ) if one creates sets 
in scratchpad storage. For example the set of articles represented by 
the logical expression, (J.J.f\R.S.)U(C.W.f\D.A.), could be created by 
the sequence of commands, 


Find art by John Jones and by Robert Smith. 
3 articles in set l. 

Find art by Charles White but not by David Allen. 
l article in set 2. 


Print art in set 1 or in set 2. 
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There is one logical structure that is not allowed in the system 
since it makes little sense in retrieval applications. This is the 
negation of any of the operands of the "or" operation. Consider the 
command, “Print articles by John Jones or not by Robert Smith.” If 
this means (J.J.UR.S.), then the articles requested would include most 
of the file since Robert Smith would have authored at most 20-30 articles. 

The conjunctive operation between each pair of prepositional 
phrases must be explicitly stated. One could not say, "Print art by 
John Jones, by Robert Smith, and by Charles White." However, one can 
omit the prepositions after the first one (e.g. “Print art by John Jones 


and Robert Smith."). 


8.345 Selection of Predecessor 

The next problem to be considered is the determination of what 
noun(s) each prepositional phrase modifies (its predecessor). Consider 
the request, "Find the articles citing articles by John Jones and cited 
by Physics of Fluids, v. 7, p. 1." The last phrase, "cited by..." can 
conceivably modify either of the two preceding "articles" words. 
However, the answer to the request is markedly different depending on 
the interpretation selected. The approach adopted here is to “attach" 
each prepositional phrase to the first noun to the left of the phrase 
that is a valid type for the preposition in question. In Fig. 8.13 the 
valid noun types that can be modified by each preposition are listed. 

Note that each preposition that immediately follows a noun and not 
a conjunction, must modify that noun and cannot be attached to other 
nouns further to the left. If the noun is not valid for the preposition 


by Fig. 8.13, then the request is considered in error. The request, 


16h, 


"Pind the articles by John Jones and the citations at Harvard University.", 
would not be valid because the preposition, "at", is not a valid modifier 
of "citations" and cannot be attached to the earlier "articles" word 


because it does not immediately follow a conjunction. 


Modifiable Noun Types Preposition Type 

<noun> <article preposition 
“article noun>, <citation nouny <word prepositiom 
€rticle noun>,<citation noun> <author preposition) 
farticle noun>,<citation noun> location preposition 
farticle noun>,<citation noun> <citing preposition> 
article noum,<citation noun> <ited by preposition> 
<noun> <set preposition> 
<article noun, <citation noun> <clustering preposition> 


Fig. 8.13. Types of nouns that each class of prepositions 
can modify. 


8.346 Interpretation of Adjectives 

Let us make two final comments concerning the interpretation of the 
language. Filler words are adjectives, adverbs and certain other words 
that initiate no action in the interpretor. They are effectively ignored. 
Their only use is to make the statement of the request more smooth and 
natural. 

There are other adjectives and adverbs that do effect the inter- 
pretor, however. Some of them are listed in Fig. 8.7. A large number of 
adjectives and adverbs come to mind that would be very useful if imple- 
mented. However only enough of them were made part of the experimental 


system so the possibility of their use in the language could be tested. 


PART FOUR: RESULTS AND CONCLUSIONS 


Part Two introduced a theoretical model for a 
document retrieval system. The experimental system 
developed to test the model in a realistic environ- 
ment was described in Part Three. In this part we 
present the experimental results obtained with. the 
system and the conclusions about the model that can 
be drawn from them. 


This final part is divided into two chapters. 
Chapter IX: Experimental Results 


Chapter X: Conclusions 
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CHAPTER IX 
EXPERIMENTAL RESULTS 


In the first section of this chapter some data on the general 
characteristics of clusters will be presented. Then some specific 
examples will be given illustrating the composition of clusters in 
terms of the frequency of occurrence of title words, authors, and 
citations of the included articles. 

In the next two sections clusters will be compared with some 
existing sets of documents which have already been judged to be 
mutually pertinent. Three bibliographies found in review articles that 
are not part of the T.I.P. file and two subject categories compiled by 
indexers will be used for this purpose. 

Finally, the results of two tests will be presented in which 


clusters were evaluated by representative users of the document file. 


g-1 Cluster Parameters 


Before attacking the problem of whether or not clusters contain 
sets of documents that are mutually interesting to users, it may be 
appropriate to first summarize some of the more general features of 
clusters. This section will, accordingly, present statistics on certain 
cluster parameters. 

The data from which the statistics are drawn come from the tests of 
Sec.'s 9.3 to 9.5. They are, of course, a function of the particular 


requests presented to the system during the tests and of the composition 
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of the T.I.P. file at the time. It was thought, however, that this 
would serve as an introduction to the experimental results. 

The first parameter that will be described is cluster size. Fig. 
9.1 shows the distribution by size of some different clusters generated 
by the procedure. The largest cluster found so far contains 159 docu- 


ments, while the smallest contains only one document. 


Number of Clusters 


1-20 21-h0 41-60 61-80 81-100 101-120 l2l-up documents 
Cluster Size 
Fig. 9.1. Distribution of cluster size for 190 clusters. 


One of the important features of the clustering procedure as 
Gescribed in Chapter V is its ability to adjust the size of the answer 
to fit the request. This is accomplished by applying a bias to the 
links of the document network (See Sec. .). About 82% of the clusters 
examined utilized either a positive or negative bias with the other 18% 
having no (zero) bias. 

In Fig. 9.2 the distribution of clusters for various ranges of bias 
is shown. Fig. 9.3 indicates that the average cluster size increases 
monotonically as the bias increases. This curve seems to follow the 
equation y* =80(x-12) where y is the cluster size and x is ‘the bias. We 


will not attempt to explain why this is the case here. 
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Number of Clusters 


0-20 20-40 0-60 60-80 80-100 bits Bias Range 


Fig. 9.2. Distribution of clusters by bias for 275 clusters. 


Average Cluster Size 


-20 +10 © 10 20 30 ko 50 60 70 80 90 = 100 bits Bias 


Fig. 9.3. Plot of average cluster size versus bias for 340 clusters, 


Another characteristic of the procedure that can be studied is the 
way documents are deleted from the set (S) that is being formed. The 
formation of 37 clusters was observed. It was found that an average of 
three documents were deleted per cluster. This resulted in an average 
deletion of one document in every 15 iterations. It was also found that 


about 90% of the documents that were deleted from S were added to S 
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some later time during the clustering. 

Let us next ask when during the clustering process deletions occur. 
Fig. 9.4 indicates that deletions are more likely to occur toward the 
end of the clustering process. 
Percent of deleted 


documents in each 
quartile 


o/h 


Fraction of Iterations 
Performed 


V/\-1/2 1/2 -3/k 3/4-1 


Fig. 9.4. Percent of deletions occurring in each quartile of 
the clustering process. 
(average for 75 clusters) 


In the final portion of this section we will describe the way the 
procedure responds to requests that are inconsistent or ambiguous. A 
specific example, (Cluster Ay of Sec. 9.33) is used for this purpose. 
The first test consisted of holding the pertinent (Y) set of the request 
constant and in successively placing every other member of the Cluster A 


in the non-pertinent (Z) set (yea, 5 z=a, i=l,...,n). The results are 


i 
shown in Fig. 9.5 and 9.6. 

There are three basic types of responses that resulted. In seven 
cases the size of the Cluster was reduced. This was, in general, what 
happened when the document specified as not pertinent had a smaller bias 
to A than ay did. In eight other cases the procedure was found to 


select another cluster (B,D, or E) containing some documents that were 
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not part of the original cluster. In the remaining twelve cases the 
request was judged to be inconsistent. A careful examination of the 
network revealed that in each of the twelve cases there was at least 
one cluster which could have satisfied the request. The reasons why 
the procedure was not able to locate a valid answer cluster in these 
cases have already been discussed in Sec. 5.51. 

Fig.'s 9.5 and 9.6 illustrate two types of request ambiguity. The 
first type is hierarchal in nature involving clusters that are subsets 
of larger clusters. Take, for example, the request, Y=a; 278196 it 
can be satisfied not only by the cluster listed for it in Fig. 9.5, but 
The second 


also by the smaller clusters listed for a7» » and a 


10 20° 
type of ambiguity is due to the fact that clusters overlap. Thus the 
clusters B, D, or E also satisfy the request Yea, 5278) 36 

A second test was conducted in order to further study the extent of 
the second type of ambiguity. In this test a given document was speci- 
fied as pertinent and a cluster was found. The document which had the 
highest correlation to the cluster found was then specified as non- 
pertinent and another search was conducted. If a second cluster was 
found then the document with the highest correlation to the new cluster 
was added to Z and the process was continued. At some point the request 
vecame inconsistent. 

The results of this type of test on six articles is given in 


Fig. 9.7. Note that document a, of Fig. 9.5 would result in the test 


1 


pattern of Example l since a is most highly correlated to A and the 


23 
is inconsistent. 


answer to the request (Yea, ;Z=a, 3) 


Articles in Bias of 


Cluster (A) ay toA 


114.9 bits 
132.7 
121.0 
130.3 
103.2 
118.4 
116.3 
131.9 
123.2 
109.8 
127.4 
104.6 
136.6 
126.1 
110.4 
102 .8 
122.0 
106.6 
116.2 
112.3 
146.) 
124.1 
155.6 
141.8 
115. 
130.4 
127.0 


Rank by bias 
(largest first) 


20 


10 


Answer to the Request: 


Yea,; Zma, 


Inconsistent 

B 

Inconsistent 
Inconsistent 

Afla, 

B 
Roar re earn) 
Inconsistent 
Inconsistent 

AMES Se) 
Inconsistent 
aNageaei6) 
Inconsistent 
Inconsistent 

D 

AN(a,¢) 

B 

ANG) 

E 


AN (8583 9812915%36818%20) 
E 

Inconsistent 
Inconsistent 
Inconsistent 

E 

Inconsistent 

E 


B=(a, 838) 0839259) plus 12 other articles 


De(a,8,8),8¢2) 7259) plus 11 other articles 


E=(2,8,855 


) plus 20 other articles 


Fig. 9.5. Example of clusters which result when documents 
are specified as non-pertinent. 


i71 
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Pig. 9.6. Diagram of relationship of clusters of Fig. 9.5. 
(Each circle represents a cluster) 


Example Size of successive answer clusters 
1 31, 22, 27, inconsistent 
2 17, 125, 4, 2, inconsistent 
3 22, 36, 23, 23, inconsistent 
h 27, inconsistent 
5 33, 27, inconsistent 
6 39, 33, 14, inconsistent 


Fig. 9.7. Test of request ambiguity. 
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9.2 Cluster Composition 

In the last section statistics on some of the more general features 
of clusters such as size and bias were presented. In this section the 
composition of clusters will be described in terms of data available 
in the T.I.P. file. In particular, examples will be given of the 
composition of clusters in terms of the title words, authors, and 
citations of the included articles. 

In Fig. 9.8 we list in order of frequency of occurrence the title 
words for six clusters. Note that the common "function" words (in, of, 
the, and, on, etc.) have been omitted from all of the lists except for 
Example A. Also the lists have been truncated to include only the words 
that occurred most often in the titles. The full titles of Example B 
are shown in Fig. 9.16. 

In none of the cases studied did the title of every article ina 
cluster contain the same word. For Fig. 9.8 the word that comes closest 
to occurring in every title is "plasma" of Example D, which occurs in 
18/22=82 % of the titles. If one were to group together words of equiv- 
alent meaning, then "superconducting" and "superconductors" in Example A 
would be highest with 27/31=88 . 

In Fig. 9.9 some similar examples are given for the authors of the 
articles in clusters. In Example A it was found that E. Schlomann is 
the author of two other papers in the T.I.P. file (in addition to the 
four listed), R. I. Joseph of one other, and W. Strauss of two others. 

In Fig. 9.10 citation counts are given for the same three clusters 
that were used in Fig. 9.9. In Example A there is one citation which 
is found in all of the articles in the cluster. In Example B, 16/6h=72% 


of the articles cite the same paper, while only 10/35=28% do in Example 
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Example A 
Cluster Ay of 


Sec. 9.33. 

31 articles 

99 words 

22 in 

22 superconducting 
19 of 

13. ultrasonic 

10 energy 

10 gap 

9 the 

8 attenuation 

5S and 

5 superconductors 

5 tin 

by 

lL determination 

k waves 

3 (11 words) 

2 (16 words) 

1 (58 words) 
Example D 
Cluster of 
Sec. 9.52. 

22 articles 
8 words 
18 plasma 

9 turbulent 

8 waves 

5 particles 

4 electromagnetic 

l turbulence 

3 charged 


Example B 


Cluster AL of 
Sec. 9.31. 
12 articles 


66 words 


waves 

spin 

garnet 

iron 
magnetic 
magneto-elastic 
microwave 
nonuniform 
propagation 
yttrium 
erystal 


Rw www ewww vI-J 


Example E 


Cluster Al> of 
hO articles 
154 words 


20 plasma 

17 probe 

ll langmuir 
probes 
characteristics 
field 
magnetic 
electrostatic 
resonance 
studies 
double 


sWEFrruuMmo 


Example C 
Cluster Ay of 
Sec. 9.337 
22 articles 
75 words 
12 quantum 
11 oscillations 
8 ultrasonic 
6 attenuation 
6 field 
6 giant 
6 metals 
5 effect 
ll magnetic 
lk magnetoacoustic 
3 absorption 
3 sound 
2 alphen 
Example F 


Cluster for article 
8 of Fig. 9.11 
22 articles 


81 words 

16 optical 

7 generation 
7 harmonic 

6 nonlinear 
5 theory 

3. second 


Fig. 9.8. Title-word frequency counts for six clusters. 
(The number to the left of each word is the number 
of times it occurs in the titles of the cluster.) 


Example A 
Cluster AL of 
Sec. 9.31. 

12 articles 
13 authors 


Joseph R. I. 
Damon R. W. 
Strauss W. 


PMN MW 


(8 authors ) 


Fig. 9.9. 


Example A 


Cluster A, of 
Sec. 9.31. 
12 articles 


Schlomann Ernst 


Van De Vaart H. 


Example B 
Cluster A, of 
Sec. 9.32. 

64 articles 
75 authors 


7 Spector Harold N. 
k Prohofsky E. W. 


3 Gurevich V. L. 
3 Kroger Harry 


3 Pustovoit V. I. 


2 (8 authors) 
1 (62 authors) 


Example B 
Cluster A, of 
Sec. 9.32. 

6 articles 


35 


PUNY NNONNNWWWEAAZD 


citations 


11-34-1298 
41-8 -357 
11-35-159 
11-35-167 
1-105~390 
1-120-200), 
11-35-1022 
1-125-1950 
11-31-1647 
11-35-2382 
11-35-2382 
11-36-875 
41-6-620 
1-12 -583 
708 -19-308 
(21 citations) 


369 citations 


BNW EULOAnNO Oo 


41-7 -237 
11-33-2457 
41-9-87 
11-33-ho 
11-34-1548 
h1+9-296 

1-127 -108) 
1-126-1974 

4 -8-) 

41-4 -505 
1-13-1302 
28-8 -161 

(4 citations ) 
(7 citations ) 
(12 citations) 
(12 citations) 
(18 citations) 
(49 citations) 
(262 citations) 
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Example C 


Cluster Ag of 
Sec. 9.52 

35 articles 
38 authors 


7 Kraichnan Robert H. 

2 Deissler Robert G. 

2 Eschenroeder Allan Q. 
1 (35 authors) 


Author frequency counts for three clusters. 


Example Cc 
Cluster of 
Sec. aoe 

35 articles 
195 citations 


802 -5-L.97 
227-2 -12h 

8 -30-301 

7997 -1030 
802-12 -2h2 

802 -13-369 

802 -16-33 

(3 citations) 
(13 citations) 
(33 citations) 
(139 citations) 


| oa 
Mw EUUUNULNIAO 


Fig. 9.10. Citation frequency counts for three clusters. 
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C. Example C is an illustration of an area where all of the articles 
do not cite one central paper and yet through the use of a large 
positive bias they can be pulled together into a cluster. 

The papers listed in Fig. 9.10 are identified by three numbers: 
The journal code (see Fig. 6.3), volume, and page number. Thus 
1136-41 is the paper beginning on page hh] in volume 136 of the 


Physical Review. 


9.3 Comparison to Bibliographies 


The next test will be to compare the bibliographies found in certain 
papers with clusters formed by the procedure. Consider, for example, a 
paper with 20 citations. It would be of interest to know if a cluster 
can be formed which includes most, if not all, of the 20 citations. 

For this purpose three articles were selected from the special 
October 1965 issue of the IEEE Proceedings on ultrasonics. It was 
decided that these articles which are not part of the T.I.P. file would 
insure some degree of independence between the data base and evaluation 
criteria. The IEEE Proceedings represented a journal which is closely 
related to the T.I.P. physics file and yet is not actually part of the 
file. Since the T.I.P. file covers only the last three years, a recent 
issue of the IEEE Proceedings was needed if a suitable fraction of the 
bibliographies of the evaluating papers were to be found in the T.1.P. 
file. 

Of the twenty-seven articles in the October IEEE Proceedings, only 
ten cite ten or more articles in the T.I.P. file. Fig. 9.11 tabulates 
these ten papers. For the three articles to be used in evaluating the 


clustering procedure we selected the two papers with the highest percent 


of their bibliographies in the T.I.P. file (1 and 2) and the paper with 


the most references to the T.I.P. file (7). 


Citations Percent of 

Articles in Proc. Total to T.1I.P. Bibliography 
IEEE Vol. 53 Citations file in T.I.P. file 

1. pp. 1495-1507 22 10 46 % 

2. pp. 1452-146) 38 16 2 

3. pp. 1517-1533 58 22 38 

kh. pp. 1438-1451 86 32 37 

5. pp. 1508-1517 h7 17 36 

6. pp. 1320-1336 33 1l 33 

7. pp. 1586-1603 128 36 28 

8. pp. 1604-1623 67 18 27 

9. pp. 1387-1399 56 13 23 

10. pp. 1547-1573 101 15 15 


Fig. 9.11. Articles in the October 1965 Issue of the IEEE 
Proceedings that have 10 or more references to 
the T.I.P. file. 


9.31 Bibliography 1 (IEEE Proc.,v. 53, p. 1195) 
From Fig. 9.11 we note that the article beginning on page 1495 
has 22 citations, 10 of which are to articles in the T.1I.P. file. 
Fig. 9.12 lists the 10 articles as set B and also lists some other 
sets of papers that will be found useful in the discussion that 
follows. The ha document in set B will be referred to as b,,etc. 
The answer clusters obtained by the procedure for 18 different 
requests are tabulated in Fig. 9.13. The symbol ALY(b, )2(b , )] stands 
for the answer cluster with b, specified as interesting and b 


J 
specified as not interesting (i.e. Y=b,), Z=(b ,)). 
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eet ian ene eee ee: 
1-136-4h2 11-36-3h53 1-129-991 
11-35-159 64.6-5-176 1-130-L39 
11-35-167 1-134-172 
11-35-1022 1-134-407 
oe F 1-136-1657 
11-36-12h3 1-137 -182 
11-36-1267 ees 11-34-1629 
11-36-1579 1-12 -328 11-34-2639 
1-12 -583 646-6-18 11-36-2387 
646-5 -33 es 11-36-3102 
41-11-69 
G ee 
1-14-2 
D 1-130-6h7 49-1)-129 
— 11-35 -836 310-7 -1892 
11-36-12h5 11-35-993 eae 
11-36-3h02 11-36-661 pa ny " 
11-36-18h5 es Brees 
9-18-2 35 
790-8 -59h 


Fig. 9.12. The sets of articles included in the 
clusters for Bibliography 1. 


Answers to Selected Requests: 
ALY(o, )]=a, for i=2...5,7,8,10 


ALY (b, ) ]=A), 
A[Y(bg) ]=A, 


ALY (9) ]=A, 


Definitions of Clusters: 
A,=(b, ++ -besb7 bg 50), )YDUE 


A =A, U(>, UF 


ALY(o,),A(m) ) J=A, 

ALY (bg) ,2(n)),) J=A, 

ALY(b,b,)]=A, 

AL[Y(b,b, )J=A, UF plus 5 members of H 
and 50 other articles 


ALY(b,+- +b, 9) ]=A, 


ALY(b, ---b, =A, UA; 


A,=(by )UEUE 
A =(b, UG 


Fig. 9.13. List of the answer clusters formed for Bibliography 1. 
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In Fig. 9.1) the probable answers for requests consisting of other 
combinations of b's are suggested. All of the requests listed in this 
figure have not been actually tested, but experience with the clustering 
procedure and the results of Fig. 9.13 make it appear reasonably safe 


to assume that the conclusions are correct. 


ALY(b,d, ) =A, for 1,j#2...5,7...10 (ifj) 
ALY(b¢b, )]=A, for i=2...10 
A[Y(b,b, )]= (large set of 70-100 articles) for i=2...10 


AL¥(b)Z(n, )]=A, for i=1...18 


ALy(Any combination of by seedy sby ++ +d 4) ]=A, 
ALY(b¢ plus any combination of by++-Byg) I-A, 


ALY(b, plus any combination of other b‘'s)=(large set of 70-100 articles) 


Fig. 9.1). Generalizations suggested by the results of Fig. 9.13. 


A diagram showing the amount of overlap of the various answer 


clusters is shown in Fig. 9.15. 


Fig. 9.15. Sketch showing the relationship of the 
answer clusters of Bibliography 1. 
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Some comments will now be made concerning the results given in 
Fig.'s 9.12 - 9.15. When the request consists of a single member of 
the bibliography, the same answer results in 7 out of 10 cases. This 


cluster, A,, contains 8 of the 10 articles in the bibliography (o, and 


1 
by are omitted). 

The article 5 is included in Ay but does not result in A, when 
used as a request. It results in an almost completely different set of 
documents (A;) which contains only one member of the bibliography. The 
request ¥(b9) is, therefore, ambiguous with either A, or A, being a 
valid answer. To resolve the ambiguity various documents from the set 
H were placed in the non-pertinent set Z. This shifted the answer from 
A, to A. It was found that the ambiguity could also be resolved by 
placing an additional document in the Y set. Thus a request of ¥(b,by) 
also resulted in the answer A)- 

The cluster A, exemplifies another type of ambiguity. The set A) 
is a subset of the set A, and thus the requests X(b, ) where i=2...5,7, 
8,10, could be satisfied by either A, or A,- The request X(b¢) can 
only be satisfied by Ay» however, since 06 is not included in A): Thus ~ 
the article b, is slightly "beyond" the cluster A, and if used in the Y 
set of the request results in more general cluster A, of 17 documents 
instead of the cluster Ay of 12 documents. Note that both requests of 
the form ¥(b,b,) with i=2...10 and the larger request X(b,-- +b) 9) 
result in the cluster Ay: 

The only article from Bibliography 1 which is not included in A, 


is b). The request X(», ) results in the cluster A, which is disjoint 


from any of the clusters discussed so far. When requests of the form 


¥(b,b, ) i=2...10 are used, very large clusters result including most 
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of the documents listed in Fig. 9.12 and many more. A check of the 
paper from which Bibliography 1 was taken reveals that by is cited 
only as a source for the values of some constants, It is suggested 
that this may be the reason it does not fit into the closely-related 
cluster Ay which includes the other nine papers. 

One final observation will be made. There are four articles in 
As and nine in A, that are not part of the original bibliography. 
The question of whether these papers constitute valid additions to the 
bibliography will be discussed in Chapter X. Let us at this point, 
however, present the titles of the papers in Al (Fig. 9.16) as an 
illustration of the type of additional articles included in the 


clusters. 


9.32 Bibliography 2 (IEEE Proc., v. 53, p. 152) 

In Fig.'s 9.17 - 9.20 we present the same data for Bibliography 2 
that were given for Bibliography 1. Here again a large majority of 
the documents (11 of 16) in the bibliography lead to the same cluster 
(A,) when specified as interesting in the request. 

From Fig. 9.20 we observe that clusters Ay aces shy form a hierarchal 
series of increasingly larger sets with each new set including the 
previous set. The set Ay, contains 1) of 16 members of the bibliography 
and 50 other documents. The set Al is the only set in the series that 


has O bias. The series can, of course, be extended to sets which are 


larger than Ay or to subsets of A, by additional changes in the bias. 


1 
There are two members of the bibliography (d¢ and b3) that do not 
fit into the pattern set by the other 1) members. The article be has 


no positive connection to any other paper (i.e. none of the papers it 
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Print the titles of the articles related to J Appl Phys v. 35 p. 159. 
12 documents in set l. 


Journal of Applied Physics, Volume 35, page 159. 
Generation of spin waves in nonuniform magnetic fields I. 
Conversion of electromagnetic power into spin-wave power and 
vice versa. 


Page 167 
Generation of spin waves in nonuniform magnetic fields II. 
Calculation of coupling strength 


Page 1022 
Magneto-elastic waves in yttrium iron garnet 


Volume 36, page 118 
Magneto-elastic waves in yttrium iron garnet 


*Page 12h5 
Electronically variable delay of microwave pulses in 
single-crystal YIG rods 


Page 1267 
Microwave magneto-elastic resonances in a nonuniform magnetic 
field 


Page 1579 
Demagnetizing field in nonellipsoidal bodies 


* Page 302 
Anisotropic spin-wave propagation in ferrites 


*Page 3453 
Propagation of magnetostatic spin waves at microwave 
frequencies in a normally-magnetized disc 


Physical Review Letters, Volume 12, page 583 
Dispersion of long-wavelength spin waves from pulse-echo 
experiments 1 


Applied Physics Letters, Volume 5, page 33 
Propagation, dispersion, and attenuation of backward-traveling 
magneto-elastic waves in YIG 


*Page 176 
Wall effects in single-crystal spheres of Yttrium iron garnet 
(YIG) 


End. 9.6 sec. used. 
Fig. 9.16. Titles of articles in the A, cluster. 


(The four * articles were not part of the 
original bibliography. ) 


B 
1-13-1302 
1-135-1761 
1-136-772 
1-136-1731 
1-138-1721 
11-35-125 
11-36-528 
41-11-26 
41-12-h7 
4i-12-555 
1-13 -43h 
41-14 -372 
646-h -82 
64.6-.-190 
646--212 
146-6-81 


Fig. 9.17. 


Answers to Selected Requests: 
ATY(b, JI=a; i=1,253,2,7,8,5, 


A[Y(d, ,) ]*A, 


ALY(b, ) =a, 
ALY(b)<) ]=A 


A[Y(b,)]=(b,) 


ALY(b, ,)]=A 


AL¥(b,») )]=A, 


D 


1-129-1009 


1-130-910 
1-131-1087 
1-131-2512 
1-132-522 
1-132-679 
1-134-507 
1-135-1388 
1-137 -311 
1-138-1250 
1-139-199 
3-81-130 
11-35-137 
11-35-1183 
11-36-3728 
21-31-1700 
29-30-19 
29-31-957 
41-13-308 
43-37-55 
hg-4-45 


D (Con't.) 
9-4-19 
49-13-285 
9-17-14 
80-19-674 
80-20-1131 
80-30-1L2h 
80-20-1617 
80-20-1946 
80-20-2160 
310-5-1818 
310-7 -688 
384-32 -100 
612 -3-18 
612-3-698 
669-16-383 
669-16-1612 
669-19-2h2 
669-19-1407 
669-12 -1113 
821-2-149 


E 
Li-IL-706 
310-6-2233 


F 
669-17 -1h32 


G 
T-136-869 
hl-12-2h1 
49-19-268 
310-6-2)73 
646-7 -hiS 
646-7 -82 


H 
1-130-919 
1-131-95 
1-131-1469 
1-133-183 
1-133-1493 
1-134-728 
1-13-1313 
1-13-1429 


BK (Con't.) 
1-135-51 
1-135-1662 
1-137-801 
1-137 -1305 
1-138-53h 
1-138-1559 
1-139-539 
1-10-2110 
1-142 -126 
3-82-01 
3-86-709 
11-36-22 
11-36-3281 
12-39-1493 
21-30-1717 
21-30-1817 
4i-11-14 
41-11-16 
80-20-363 
669-21-103) 
821-2 -1h1 


The sets of articles included in the clusters 
for Bibliography 2. 


11,12 ,14,16 


Definitions of Clusters: 


B,=(b bab :bebybgbyb, 1b, 9b, ) by ¢) 


By"B, U6 
BB, UD, 
B,=B3 YPa5 


Fig. 9.18. 


ALY(b, 5b, 6) =A, 
ALY (bd) -) =A), 
ALY (bb, 3) J=A, Yb, ,U(29 others ) 
ALY (b,...bpb7. + 6b, 503) ++ -b 6) J=Ay, 
AL¥(b, |, )2(b3 )=Ax (bob 7B BB) Gh) oP 3) 
ALY(b,) )2(b,b, ,)]=(ogbgb 101) NU 

(4464649045 ),4954),) 


A,=B, UD 


A,B, DUE 
A3=B,U DUE UF 
A,B), U DUEUFUG 
Age(03b, 344) 


List of the answer clusters formed for 
Bibliography 2. 
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AL[Y(b,b ply for b, vB, 

ALY(b, 4b, )]=A, for b,CB, 

ALY(b,, )]=a, for b,CB, 

ALY(b) <b, )]=A), for v,CB, 

A[Y(byb, )]= Inconsistent (b¢ is not linked to any other paper.) 
ALY(b, 3b, )]=A, Ub, (29 others) for vB, 

ALY(x, )]=A, for XCB, 

AL[Y(b, (X, )]=A, for XCB, 

ALY(b) xX, )J=A3 for XB, 

ALY (1 -X;)]=A,, for X; B, 


Fig. 9.19. Generalizations suggested by the results of Fig. 9-19. 


Fig. 9.20. Relationship of answer clusters of Bibliography 2. 
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cites are cited by other papers) and is thus isolated from the rest of 


the file. Article b ean be included in a cluster with the rest of 


13 
the papers if the bias is made large enough. The cluster ALX(b) >, 5)] 
contains, for example, all of the bibliography except be. 

There is one significant characteristic that the five papers not 
included in A) have. They all have relatively few citations. Articles 


be and b)3 have only two citations each. Articles b,, and bis have 


10 
only three, Article b, has seven. In contrast the bibliography 
articles in Ay all have seven or more citations except by and by 
which have five each. It is suggested that perhaps the reason be and 


b,, are not included in the cluster A, is that they have insufficient 


13 1 
references to position them properly in the network. 


-33 Bibliography 3 (IEEE Proc., v. 53, p.- 1586 

In Fig.'s 9.21 to 9.2) the data for bibliography 3 is presented. 
The paper from which this bibliography is taken has four sections 
(I,II,III,IV) with section III haveing four subsections (III A, B, C, D). 
The particular section (and subsection) in which each bibliographic 
item is first cited is noted in Fig. 9.21. These section numbers are 
also noted over the symbols for the documents in Fig. 9.23. Some of 
the documents in Fig. 9.23 are inclosed in parenthesis. This is to 
indicate that the document has already appeared elsewhere in the 
diagram. 

From Fig. 9.23 we note that a hierarchal series of clusters (A, to 
A.) similar to the one in Pig. 9.20 is formed by 13 of the documents 
of Sec. III. A similar but separate series (A, to Ag) is formed by the 


documents of Sec. IV. There also appears to be a separation of the 
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B 


1-129-12 
1-129-18 
1-129-652 
1-131-111 
1-131-653 
1-131-1197 
1-131-2h20 
1-132 -1062 
1-132-1073 
1-132 -2039 
1-133-1487 
1-135-7L0 
1-135-1161 
1-136-1096 
1-137-211 
1-137 -889 
1-137-1h00 
1-138 -L,87 
21-29-357 
h1-11-316 
1-12-10) 
41-12 -166 
1-12 -360 
h1-13-162 
h9-7-112 
49-8-155 
9-8 -160 
l.9-12-297 
h9-13-287 
49-14-13 
h9-1h-73 
49-17-18) 
646-6-111 
669-17 -50 
669-18 -l,03 
669-20-552 


D 


1-130-929 


1-132-522 
1-132-535 
1-135-181 
1-137 -883 
1-10-1355 


9.21% 


IIID 
IV 

IV 

IV 

Iv 

ITIA 
IV 

ITID 
IIIe 
ITIC 
T1ic 
IIIC 
IV 

ITID 
IIIC 
TIIC 
ITIE 
ITic 
TIID 
ITIA 
IV 

Tiic 
Ific 
IIIA 
IIIc 
IIIc 
IV 

IIIA 
ITIc 
ITIA 


The sets of articles included in the clusters 


E 


1-129-1990 
1-131-2512 
1-133-1589 
1-134-507 
1-136-1170 
1-137-1717 
1-138 -88 
1-138-1453 
1-139-18h9 
41-12 -357 
310-7 -383 
669-17 -628 


F 


669-18-1125 
669-19-159 


a 
1-138-1191 
669-16-15h 
669-18 -h19 


ee ae: ers 
1-133-8) 
1-136-22 
41-11-552 


aces. eee 
h9-5-233 
h9-7-133 
80-20-1h2), 


for Bibliography 3. 


K 


1-131-73 
1-132-621 
1-134-1 
1-135-19 
1-136-306 
1-136-203 
1-136-893 
1-136-1471 
1-138-1661 
1-139-7h6 
1-10-1902 
1-1h1-52 
1-1h3-229 
41-15 -862 
669-16-9h5 
669-18-8 34 
6669-21-70) 


R 
669-18 -1260 


M 


1-129-1088 
1-130-92 
1-130-565 
1-131-617 
1-131-1995 
1-131-2078 
1-132-1512 
1-133-L43 
1-133-15L6 
1-135-1698 
1-137-1172 
1-137 -1706 
1-139-823 
1-139-1459 
1-10-2051 
1-10-2065 
1-141-52 
1-141-553 
1-143-h06 


M (Con't. ) 
80-18-1569 
669-16-14)81 
669-17 -87 
669-18 -52 
669-18 -896 
669-20-267 
669-20-560 
669-20-583 
669-2175 


N 


1-131-2h33 
1-131-2)63 
1-132-1991 
1-136-998 
1-137-431 
hi-12-558 
80-20-1136 


P 


1-133-1104 
1-139-1876 
1-143-h52 
49-13-282 


Q 


1-129-2055 
1-132 -1885 
1-14,0-187 
1-10-1429 
1-141-592 
9-7-7 
49-12 -297 
80-20-1374 
310-6-2565 
669~-16-818 
669-16-1459 
669-18 -908 


Answers to Selected Requests: 


ALY(b, )J=a, 1°1,2,20,23,36 

ALY(b,) )]=A, 

ALY(b3-)]=A, 

ALY(b,) J=A), 

ALY(b, )]=4, 1=15...17,22,2h, 
28,29, 32 

ALY(b, )]=A, 1=8...11,13,27 

ALY(b¢) =a, 

ALY(b, )]=Ag 1=18 ,19 

ALY(b, )]=A, inh, 3h 

ALY(b,)]=A, 4 

ALY(b35)]*A)) 


Definitions of Clusters: 
Ay =(byb2b) by 30 5).03604 601 gPQ MY 
pUz 

A.A, U(o7>.),) UF 

AsrA, Ulos5)UG 

AA, U (os) UH 

Age (by 2b 6b) 7>ygPoqPo1o2Po}, 
bog%aqh39) UDUTU(e,h, ) 


Age (bgPgby 013s 327 UU 


(hho eyes) 


Fig. 9.22. 
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ALEC, A353) 7b 3835, ) 

ALY(b, )]= Misc. large sets of 
documents (88-159 articles) 
1=3,12,25,26, 31,33 

ALY(b, gb, , )J=A, 

ALY (baby obs P35) ILA, U As (db; cf, )] 

A(e59) 

ALY(byb, 9) ]=(cluster of 108) 


ALY( 6b, gh q>35) "Ay, 


A746 Ul>6) UL 

AgwA, J(04 919) 

Ag=(b) deb, 1.2 5)>5¢)UM 

Ayo74g Uo, URUle, ) 

Ayy7(>ybsb7>3q)UPU 
(agejegegh hom 5m) 7%) 

Ayo" A3 UAsU (m,297) 


List of answer clusters formed for Bibliography 3. 
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tlie IIIc IiIe MIric 


by, = ogi 


Syisds 
(g, (hn) 


IIIA IIIA 


IlIc IV 
(big) Pro 


Fig. 9.23. Relationship of answer clusters of Bibliography 3. 
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documents by subsection within Sec. III. Note that 10 of the 13 docu- 
ments cited in subsetion ITIC are included in cluster Aye 

The structure of the clusters in this example was found to be 
considerably more complex than in the previous two examples and no 
attempt is made to predict the results of requests that have not been 
explicitly tested. One can gain some appreciation of the complexity of 
the interrelationships between the clusters by an examination of 
clusters Ay to Aya 

As with Bibliographies 1 and 2 there are a few of the documents 
that are not included in the clusters of Fig. 9.23. Wine articles are 
cited by Sec. IV. All of these except b33 are included in the cluster 
Age Thirteen articles are cited by See. ITIC. All of them but bsb5)> 
and by are in Ag and all but by are. in Ayo» The cluster Ao is more 
general in that it includes not only articles cited by See. ITIC but 
also those cited by Sec.'s IIIA, D and E. Of the 27 articles cited by 


~ 


Sec. III, 20 are included in Aj>- The seven missing articles are b3sde5 
The article b was examined in detail in an attempt to discover 


why it was not included in A It was found to have six references. 


12° 
Of the six, one was keypunched incorrectly. Two of them are to articles 
in a Russian journal (Soviet Physics - JETP), whereas the other refer- 
ences to these articles in the T.I.P. file are to the journel in which 
the English translation is found. A fourth reference is to a paper 
written by the same author and not cited by anyone else, and a fifth is 
to a bulletin, which was evidently not sufficient to cause it to be in- 
cluded in Ayo It was found that if the references had been correctly 


keypunched and had been to the correct English translations, b3 would 
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have been included in Ay and probably Ajo° 

There is one other feature of the article from which Bibliography 3 
was taken. In the final paragraph the author made this comment. 

"I wish to thank ...A. R. Mackintosh for calling B. I. 

Miller's work to my attention.” 

The article by B. I. Miller was checked to see if it would have 
been included in any of the clusters if it had been part of the T.I.P. 
file. It was found to have only one reference but this reference was 
sufficient to cause it to be included in Ane Thus this procedure 
could have performed the same reference service that A. R. Mackintosh 


did. 


9.4 Comparison to Categories 
In the last section we compared clusters to the bibliographies 


compiled by the authors of three articles. Another. source of sets of 

articles that have been judged to be related would be the subject index 
found in one of the journals or in Physics Abstracts. For this purpose 
one category was selected from the subject index of Physical Review and 


one category was selected from Physics Abstracts. 


9.41 Physical Review Category 


Most of the categories in the Physical Review Subject Index are 
very broad. The sets formed by clusters, on the other hand, are in 
general much smaller and much more specific. Of course, larger clusters 
could be formed by including a large number of articles in the Y set of 
the request, but they would require a large amount of effort to process 


and compare. For this reason a category with relatively few entries was 


selected. Its title changed periodically over the three year period, 
but it was identified as the one which was referred to when one looked 
up the word, "luminescence" in the word list which was supplied with 


the subject index. The various titles used for the category are as 


follows: 
1963 Luminescence (18 articles) 
196) 6.4, Luminescence and Fluorescence (6 articles) 
1965 2.3 Optical Emission and Absorption (17 articles) 
1966 4.3 Optical Emission and Absorption (2 articles) 


The same format used for presenting the data in Sec. 9.3 is used 
here in Fig. 9.2h-26. 

It will be seen from Fig. 9.26 that most of the papers separate 
into the three major areas represented by Ang» Ag» and Ad6: A statisti- 
cal analysis of the composition of each of these three clusters is given 
in Fig. 9.27. It is found that the only words that appear more than 
once in the titles of two or more of the clusters are optical, absorp- 
tion, radiation, and crystals. The correspondence of these words to the 
title of the original category (optical absorption and emission) is of 
interest. 

A similar analysis of the author lists showed that N. Bloembergen 
was the only author that appeared more than once in two or more of the 
lists. The citation lists were also found to have very little overlap. 
The greatest overlap occurred between ) and An6: For example, the lst, 
3rd, 5th, 7th entries in the list for Ag were found in the list for Ay6 
with a count of 2. 

It is thus concluded that the articles in the clusters A,- Ag» 


25? 
and AD6 do have different characteristics. Whether the distinction 
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1+129-169 


1-129-593 
1-129-2h22 
1-130-502 
1-130-639 
1-130-945 
1130-2257 
1-131-127 
1-131-501 
1-131-508 
1-131-1114 
1-131-1456 
1-131-15h3 
1-131-2036 
1-132 -22h 
1-132 -1023 
1-132 -1482 
1-132-2501 
1-133-1163 
1-136-1h1 
1-136-271 
1-136-508 
1-136-541 
1-136-1091 
1-137 -508 
1-137-536 
1-137-1117 
1-137-1651 
1-137-1787 
1-138-63 
1-138-180 
1+138-806 
1-138-1741 
1-139-321 
1-139-5h4 
1-139-1239 
1-139-1616 
1-1h0-155 
1-1h0-263 
1-140-601 
1-10-1867 
1-1)3-372 
1-143-57h 


Fig. 9.2h. 
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1-13h-1166 
1-137-801 
1-138-1 
1-138-960 
3-82 -393 
3-85-565 
3-86-709 
h1-12-504 
41-13-33h 
41-+13-657 
h1-13-720 
49-10-52 
9-11-29 
64.6-6-25 


E 


1-139-10 
1-10-1051 
1-141-287 
1-141-306 
1-14-68 
199-138-753 
199-139-202 


F 


1-129-125 
1-132 -2023 
1-137-1515 
1-138-1:72 
1-138-1477 
1-139-1262 
1-139-1991 
1-10- 352 

1-1-6) 

49-19-89 


G 


1-139-588 


1-10-576 


H 


1-129-1980 
1-132-2h50 


J 


1-131-1912 


1-132-1029 
1-135-950 
1-135-1622 
1-137 -1087 
1-138 -1287 
1-139-31h 
11-34-1682 
11-35-1183 
12 -38-15hh 
12-38-1607 
12 -38 -2289 
12-39-3118 
12-2 -1999 
h9-18-219 
49-19-98 
80-18-1448 
80-19-1096 


eee eee 
1-133-1029 
1-136-81 

12-h2-3h0h 


R 


1-133-163 
1-133-1717 
1-13h4-299 
1-13-23 
1-135-1676 
1-137-583 
1-137-1016 
1-138-276 
1-139-1687 
1-139-1965 
1-14,0-880 
80-19-2260 
669-21-20 


80-19-92 


N 


1-140-957 
49-5-186 
612 -\-264, 


eens. oem 
1-139-970 


The sets of articles included in the clusters 
for Category l. 


Answers to Requests: 


A[Y(b ) J=A, 1=29,h2 ALY(b, )J=A,, 1=10,12 

A[Y(b, )J"A, — 1=26,43 ALY(b, )]"a,,  1#13,18,20 

ALY(b5) )]=A, ALY (bo) ]=A,) 

ALY(b, )]=A, 1=33,37,38  ALY(b,-) =A), 

ALY(b.9 ) ]=A, ALY(b, )J"A,,  i=h,6 

ALY(b39)]=A, ALY(b),)]=A,), 

ALY(b, )]=a, 198,29 ALX(b,)]=(b,)  4=3,9,42 

ALY(b, )]=Ag ALY(b, )J=(large clusters) 1=23,32,36 
A[Y(b)) )]= ALY(b,b,b, 5 ))=(107 articles) 
AlY(b39)]*A19 ALY (bo gb3),) "A; Ag Age 

ALY(b, )]=a,, ALY(b,gb353),)J=(10b articles) 
ALY(,, ) I=A,, ALY(b,-b) ))=(large) 

ALX(b, )]eA,,  495,12,27 ALY (bgb, 7) J=(1arge) 

ALX(b, I-A), A[Y(b,b 39) J=(aree) 

ALY(b 32) "As ALY(b, .0),9) }=(large) 

ALT(b),4) ]*Ay6 ALY(b7b33 Py) }*(Ay 5 UAy7 UG) Maar rgb yP7) 
ALY{b, ) ]*4,, A[Y(b, gb, 4Po7) JmAc UAy7 UArg UA20 
Aly(b, )]*4,5  1*7,22,2h Uloge, Pj fg )=A,6 
Defihi tions of Clusters: 

Asolo, 0530), )UD Ay)", 3 U(b,, UK 
AgmAy J (oo6>,3) Ay 5™Ay), U(%3)75) 

A,"A, U (b3)>35) Alc? (b,b7b5 90, 9) U(r. ++ gh oh iy ) 
A =(b,9>3337b3 UE Ay77(b, bby 7d,g)UR 
AgeAy, Uj (b>3) | Ayg=(b7b,9bo),m) ToT) 
Ag=(b394)) Ay gn (by gs 4)) 
Ap=(bgb,.)U FUG Ag™(b3 31 gP2q19% UM 
Agra U (by ch, ) Ay =(bogk, ) 

AgwAg LJ (by) ) Azgn(bogbcP, ) 

Ay gn (3985) A3n(bbg) 
Ay,"()8,8) Ay) =(>y oF 77) 1) 

A 2=(b, 77,8) Aog"A3 As 


Ag bgdy2Po7)UlrstoriorizIUF —— Age" SUA ArgUAg (0g8,P £6) 


Fig. 9.25. Answers to selected requests for Category 1. 
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Fig. 9.26, Relationship of answer clusters for Category l. 
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CLUSTER CLUSTER A CLUSTER A, 6 
G0 articles (18 articles) (55 articles) 
109 words 84 words: 214 words: 
13° raman 7 sic 12 ruby 
9 stimulated 6 Execiton 11 optical 
6 laser 5 Complexes 9 lines 
6 radiation Absorption 8 KCL 
6 scattering lk Luminescence 8 spectra 
5 theory 3 cas 7 erystals 
h fluctuations 3 Effects 6 absorption 
\ intensity 3 Emission 6 thermoluminescence 
3 effects 3 Nitrogen 5 excited 
3 emission 3 Optical 5 F 
3 liquids 3 Radiation 5 MgO 
3 media 3 Recombination 4 center 
3 optical 2 Cadmium h crt 
3 order : 4 irfadiated 
3 waves 7 h R 
2 anti lh relaxation 
b 3 alkali 
37 authors: 25 authors: 85 authors: 
5 Shen Y. R. 6 Choycke W. J. Sturge M. D. 
h, Bloembergen N. 6 Hamilton D. R. 5 McCumber D. E. 
2 Armstrong J. A. 2 Patrick Lyle 3  Bloembergen N. 
2 London R. 2 Dean P. J. 3 Schawlow A. L. 
2 Smith Archibald W. 2 Reynolds D. C. 3 Yen W. M. 
2 Tang C. L. 1 Anders W. A. 2 Arten J. 0. 
1 Anderson H. G. : ; 
292 citations: 248 citations: 846 citations: 
12) 1-127-19 13 1-4-361 22 0-13-880 
10 1-130-2529 11 1-128-2135 151-122-381 
10 1-131-2766 11 41-1-h50 15 12-36-2757 
10 1-133-37 10 1-127-1868 Vy 11-34-1682 
10 41-9-55 8 1-131-127 131-122-1469 
10 41-11-160 7 1-116-473 10 1-130-639 
10 h9-7-186 6 1-133-1163 10 12-20-1752 
9 646-3-181 5 1-120-166) 9 80-13-899 
8 41-11-19 5 1-127-1878 8 1-57-26 
8 1-12-50 5 1-132-2023 8 30-31-956 
7 1-13-1429 h (5 citations) 7 (3 citations) 
7 6h6-3-137 3 (7 citations) 6 (12 citations) 
6 h1-12-290 2 i citations) 5 (8 citations) 
5 (5 citations) 1 (18 citations) (18 citations) 
(11 citations) ; 3 (33 citations) 
3 a citations) : 2 (121 citations) 
2 (3h citations) 1 (71 citations ) 
1 (212 citations) 


Fig. 9.27. 


Comparison of the three clusters formed for Category l. 
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between the clusters is of practical significance to a user would, of 
course, require further experimental justification. 

As an additional comparison the results of this section were com- 
pared with the articles found in the category in Physics Abstracts with 
the title, "luminescence." This category contained 22 of the articles 
listed in Fig. 9.2). (14 in set B and 8 others.) All of these 22 
articles were included in Ag or Anes This would tend to indicate that 


the Physics Abstracts indexers considered the articles of Ang to be in 


a different area than Ag and Ax6 also. 


9.2 Physics Abstracts Category 


Since a property (luminescence ) was chosen for the last section, 
it was decided that a category covering a substance might be appropriate 
for this test. We again sought a category with relatively few entries 
so that it would be easier to compare it with the related clusters. 
The category with the heading, "Erbium", was selected. The articles 
classified in this category from January 1963 to the present are listed 
in set B of Fig. 9.28. Fig.'s 9.29 and 9.30 present the related 


clusters. 


9.5 User Experience 


In the last two sections we compared the results of the clustering 
procedure to the three bibliographies and two categories. In this 
section we will present the response of the system to some actual 
requests for information. The response to both a relatively simple 


request and to a more comples request are studied. 
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1-131-1043 
1-131-1586 
1-132 -1609 
1-137 -138 
1-137-1109 
11-35-10h7 
11-36-1001 
11-36-1127 
11-36-1249 
12-38-2190 
12-39-1285 
12-39-1629 
12-39-2128 
12-40-2751 
12-10-3606 
12-41-1225 
12-41-3363 
12-2 -873 
12-13-87 
29-29-77 
9-8-5 
h9-11-100 
4g-13-112 
49-15-301 
9-16-265 
49-17-95 
80-20-808 
80-20-1332 
199-137-790 
310-6-2225 


D 
1-129-2072 
1-130-1337 
1-130-1825 
1-131-932 
1-131-1039 
1-138-216 
1-139-1606 
1-10-1896 
3-81-86 
3-84-63 
3-8) -693 
11-36-906 
11-36-1078 
11-36-3628 
12 -39-14h9 
29-31-1 
hg-6-19 


Fig. 9.28. 
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1-131-158 


1-13-1620 
1-137-1139 
1-138-2h1 
3-85-955 
11-36-1209 
49-17-96 


F 


1-132-5h2 


1-133-219 
1-134-9h 
11-35-800 
12-43-2087 


G 


1-129-1601 


1-130-1100 
1-133-1571 
1-134-320 
1-13-1192 
1-136-175 
1-136-231 
1-136-271 
1-136-711 
1-136-717 
1-136-726 
1-137 -627 
1-137-1Lh9 
1-10-1968 
1-141-352 
1-141-61 
3-81-663 
12 -39-1h22 
12-39-1455 
12-39-3503 
12-2 -377 
12-2-981 
12-42 -1h23 
21-29-98 
21-31-85 
21-31-1325 
49-10-16 
49-10-96 
310-7 -1150 


H 


1-139-2h1 


3-82-87), 

12-38-2750 
12-h2-l000 
12-43-1680 
80-18-1636 


J 


1-130-2325 


1-132 -280 
1-133-881 
1-136-1)33 
1-10-2005 
1-142-115 
12-41 -565 
12-1-617 
4l-11-196 


K 


1-1h1-h 


43-36-505 
1-137 -1886 
1-139-2008 
3-8h-297 
12-38-976 
12-38-2171 
12-39-3251 
12-l0-796 
12-0- 328 
12-2 -162 
12 -42-993 
12-2 -3797 
12-43-212h 
hl-11-253 


M 


1-130-9L5 


1-130-1370 
1-133-3h 
1-133-L9oh 
1-134-172 
1-134-150) 
1-137-179 
1-138-1682 
1-1h1-259 


M (Con't.) 
12-39-102) 
12-39-115h 
12-0-7h3 

12-l1-892 

12-42-7443 

164,-39-32 
310-7 -1450 
a 
1-138-15hh, 
12-38-1476 
12-38-2190 
12-39-213 
12-41-1305 
12-41-3227 
12-43-1702 


P 

1-133-1364 

49-19-63 

eee. eee 

12-11-1970 
R 


11-36-2h22 
80-20-997 


oe 
1-133-136) 
i 


21-29-97) 


9-20-96 
is ov 
669-17-1118 
669-18 -1022 
ees Wes 
1-135-97 
bag We 
1-10-1188 
1-1h1-251 
x 


11-36-98), 
12-)1-892 


The sets of articles included in the clusters 
for Category 2. 
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Answers to Requests: 


A(¥(b, )]=a,  i=1,6,11,20 ALY (bg) J=Ay), 
ALY(b,.) ]=A, ALY(b),) ]*A,¢ 
AL¥(b, )]=a, A[Y(b,3)]=Aj¢ 
ALY (, 7) J*A,, ALY(b,9)]=A, 7 
ALY (b)) )]-A, ALY(b, 6) ]=A,. 
ALY(b,))]=Ag ALY(b52)}=(b5¢) 
ALY(b) 9) ]=A, ALY (15) J=Ayg 
ALY(b1-) J=Ag ALY(b, ,)J=A,g 
ALY (b),)]=A, ALY(b, )]=A,) 
A[Y(b, 4) ]=A,, ALY(b, )]=A,, 
AlY(b, 3 )]-4,, ALY(0, )J=A, 3 i=3,21 
ALY(b, )]-A,, i222 ALY(b) )]=A,), 


ALY(b,)I=A) , 
Definitions of Clusters: 


Ay=(b bg) P99) UD Ay 7 (bg) UR 

A,=A,U (0,7) UE A, ge(b, 0, )S 

A-A, U(b, UF Arg" (ba g86M5)UT 

Ayo (50), UG Uldye; ) Aa7* (eq) 

Ag=A) L (b,) )UB Ag (56) UV 

Ag=e U (dob, oP 3q8g47f3 UT §y9° (1.9 41,) 781081982 3826h2I24),57 
Ag=Ac (0, 3 fk Kp) iy Ae Hy 5M D4 DG) 

AgtAa U (Dy ek3- eke) Ao 9° (Bq 301 79.983861,817818819%2 182286 


hobby J7k 3k) Kekek, yk) ),m) ny) 


Ag=Ag (J (b,.) UM 
ooo A, = (oeb, cegdgky),) UW 


Ayo*Ag U (1 6)UN 
Ay 7Ay 9 U (0,1 3 fe JUP 


Ayo=(by9b5) 4) f),) 


Ay97 (dpb 7Paglyd7e),f 389+ + -BEBy9°+ Bis 
8) 78182182 3So58o7 Moh 3h), Jy ++ -Jgdqq) 


Ay 37 (03011, gh 1039 98581 5819827829 


hagrtbg) Ua by J8.J9lgm 4179} 
13 °°8 U Ay (833 es Aes) 


Fig. 9.29. Answers to selected requests for Category 2. 
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Maas a ane By ++ Bog QRS 
(a, )(e),) 
Pa, hy +ehg 


Fig. 9.30. Relationship of answer clusters for Category 2. 
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9.51 Simple Request 
This test was performed in cooperation with a research physicist 
from Lincoln Laboratory. His initial request consisted of the following 
relatively brief specification: 
words: turbulence 

subsonic 

hypersonic }>perhaps 

wake 


authors: Lees 
Hromas 


articles: none 

No articles were found which were written by the two authors 
(actually there were three papers by a Lees but in a completely 
different area). There were 70 articles that had either "turbulence 
or "turbulent in their titles (set T of Fig. 9.31). ‘There were 27 
which contained one or more of the words "wake, "subsonic", or "hyper- 
sonic". (Set W of Fig. 9.31.) 

At this point a number of the articles in Set T were used as 
requests to the clustering procedure. The cluster structure shown in 
Fig. 9.32 and 9.33 resulted. The physicist was asked to evaluate the 
pertinence of each of the articles presented. He gave three types of 
responses: pertinent (y), non-pertinent (n), and questionable perti- 
nence (m). ‘The responses are indicated in Fig. 9.31 and also in Fig. 
9.32 by the superscripts. It will be noted that nine of the twelve 
articles specified as pertinent are in the A, cluster. 

The physicist was asked if there was any detectable difference 
between the article in the A3 and Ay clusters which were disjoint by 
the procedure. Of the 16 articles in Ap» 15 were from Russian journals, 


while 27 of the 35 articles in A, were from American journals. It was 


T 


11-36-2075 
11-36-2201 
21-31-141 
29-30-17 
1-1) -813 
h1-1),-892 
h1-15-381 
49-9-14h, 
9-12 -201 
49-13-297 
9-18-22) 
80-19-1430 
384-32 -292 
646-7 -285 
669-16-295 
669-16-1578 
669-17 -403 
669-17 -1hh9 
669-18 -8h:7 
669-18-1251 
669-18 -1268 
669-19-31)9 
669-20-4)5 
669-20-1519 
669-21-7hh 
669-21-77h 
669+21-1161 
790-6-882 
790-6-1017 
790-7 -3h4h, 
790-8-5h 
790-9-1057 
790-9-1h29 
790-10-191 
790-10-10h1 


Fig. 9.31. 


SRBRBOO DB BUD EBM BPP BBB EB RBS SMH BBP BP BH BP Pee eo 


T (Con't. ) 


799-6-1016 
799-6-108 
799-6-1250 
799-6-1260 
799-6~-1693 
799-7 -190 

799-7 -335 

799-7 -562 

199-7 -629 

799-7 -816 

799-7 -1030 
799-7 -10),8 
799-7 -1156 
799-7-1160 
799-7-1163 
799-7 -1169 
799-7 1178 
799-7 -1191 
799-7 -1,03 
799-7 -1723 
799-7 -1735 
799-7 -1920 
799-8 -391 

799-8 =1,92 

799-8 -575 

799-8 -598 

799-8 -1063 
799-8 -1509 
799-8 -16h7 
799-8 -1659 
799-8-1775 
799-8 -1792 
799-8-2219 
799-8 -2225 
821-2 -332 


Sets of articles included in the 


DOS“ BRPBHBPBMH BU PSS SSB BEB BS BBB ABBE BE SSBB 


W 


1-134-581 
1-135-1761 
1-138-93), 
3-82 -669 
11-36-3) 
41-10-127 
41-13-h37 
h1-12-592 
1-13-72 
1-15-36 
49-19-59 
80-18-288 
80-18-1515 
646-.-28 
646-7 -187 
799-6-9h6 
799-6-1388 
799-7 -197 
799-7 -667 
799-7-114,7 
799-7 -1198 
799-8-kh 
799-8-211 
799-8 -956 
799-8 -1h28 
799-8 -1456 
799-8-1792 


clusters for Physicist 1. 


(y=pertinent, n=non-pertinent, 


m=questionable pertinence ) 


D 


11-36-3609 y 


17 -32-298 


n 


669-18-698 n 
669~-18-1014 n 
669-19-99 n 
669-19-1165 n 
669-20-135 n 
790-10-605 n 
799-6-1603 n 
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202 


ALY(t, )]*(t gt pty otsotssteg 
toate test eats 
=A, 1#)6,47,49,50,55, 
68 


A[Y(dg) ]=A, 

ALY(t56)]#A, U (tg) 

ALY (ty, A, U (ty ¢t50 ) 
ALY(t,3)]*4, U (t3 6,8) 

ALY (te, "4, U(t3¢t,.gt52%ey) 
ALY(ty, )I=A, UC 36tygtca tet sy 


ALY(t, )]*(ty ot, stay, togtogtor 
454) 444745 ) 
rAg [ 1=19,2h,25,26,27 
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29,30,35,h0,41, 
2,4 ,45,59,63 


Fig. 9.32. Answers to selected requests for Physicist 1. 
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Fig. 9.33. Relationship of answer clusters for Physicist l. 
(y=pertinent, n=non-pertinent, m=questionable pertinence ) 


204 


initially thought that the cause of the separation of the two clusters 
was probably due to the fact that the Russians generally cited Russians 
while the Americans cited Americans, After examining the two sets, the 
physicist expressed the opinion, however, that Ay appeared to be more 
concerned with the upper atmosphere and ionosphere. 

Also supporting the contention that there is a valid and useful 
distinction between A, and AD is the fact that nine of the eleven 
articles judged to be pertinent were from the Ay cluster. 

Because of the incompletely inverted files and the delays caused 
thereby, the actual searches were performed by the author of this 
thesis and later discussed with the physicist. It was interesting to 
note that at one point in the discussion, he stated that he could have 
more correctly shaped the final cluster by being able to specify as non- 
pertinent some articles on turbulence in helium that appeared in one of 
the clusters. 

We note in passing that the physicist who aided in this test is 


the author of article te7° 


9.52 Expand Extensive Bibliography 


In this section an example is given of how the clustering procedure 
might be used to supplement or extend an already sizable collection of 
papers on a given subject. 

A bibliography of 112 articles on Langmuir probes was supplied to 
the author by another research physicist at Lincoln Laboratory. Of the 
112 articles, 89 are to journals, 5 are to the 25 journals covered by 
the T.I.P. file, and 21 are actually in the T.I.P. file. The identifi- 


cations of the 21 articles in the T.I.P. file are given in Fig. 9.3h. 
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Fig. 9.35 shows the distribution of the articles in the file with time. 
Fig. 9.36 lists the words occurring in five or more of the 112 titles. 
In this list words such as "of, "the", "theory", etc., have been omitted, 
Also words have been grouped by stem. Thus, the words, "ion", "ions", 


"ionized", etc., are all grouped under the word, "ion". 


Set B B (Con't.) B (Con't.) B (Con't.) 
3-82-2113 11-36-1866 911-126 799-6-1)92 
11-34-1165 11-36-2363 80-18 -260 7992-1433 
11-3)-3209 21-30-182 80-18-1908 799-7 -18):3 
11-35-1130 21-30-193 690-8-720 799-8 -56 
11-36-337 21-30-375 799-6-1h79 199-8-73 
11-36-675 


Fig. 9.34. 21 Articles in Langmuir Probe that are in 
T.I.P. file. 


Number of Articles 


1 i 
Z 2 year 
0 0 


Fig. 9.35. Publication year distribution of initial 
Langmuir Probe bibliography. 
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Words Number of articles 
probe 87 
plasma ho 
Langmuir 35 
ion 18 
gas 15 
discharge 13 
electron 12 
collection 10 
density 8 
low 7 
pressure 6 
spherical 6 
electrostatic 6 
probe and plasma 32 
probe and Langmuir 35 
probe and ion 16 
probe and gas 7 
probe and discharge 6 


Fig. 9.36. Title word distribution for the 112 titles of Nt. 
the initial Langmuir probe bibliography. 


As an additional part of this test it was decided that five other 
types of search strategies would also be used and their results would 
be compared to the results of clustering. The five search strategies 
selected will now be described. 

TITLE WORD SEARCH 

Qne possible search strategy would be to retrieve all those 
articles which have some word or logical combination of words in their 
titles. The choice of the word or words to be used was made on the 
basis of the frequency of occurrence of the words in the bibliography 
(Fig. 9.36) and in the T.1.P. file and with the advice of the physicist. 
Several test runs were made with various word combinations. A simple 
request for all articles with the word,"probe", in their titles was 
Selected. This retrieved 58 articles including 20 members of the 


original bibliography. 
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AUTHOR SEARCH 

There are 11) different authors of the 112 articles in the biblio- 
graphy. A search of the T.1.P. file for articles by these 11) authors 
yielded 120 articles (21 from the original bibliography and 99 other 
papers). This search was not exhaustive but involved looking for 
authors only in those journals where it was thought they might publish. 
CITATION SEARCH 

The third type of search consisted of finding all of the articles 
that cite one or more of the ll2 articles in the bibliography. A 
search of the T.I.P. file using this criteria yielded 78 articles. 
BIBLIOGRAPHIC COUPLING SEARCH 

When two papers cite one or more of the same papers they are said 
to be bibliographically coupled (Sec. 6.22). There are 270 articles 
that are bibliographically coupled to one or more of the 21 articles 
in set B of Fig. 9.3h. 

The coupling strength between two papers is defined to be the 
number of identical citations that they have. The coupling strength 
between one paper and a set of papers is defined to be the number of 
citations in the single paper which are also found in one or more of 
the papers in the set. In Fig. 9.37 we show the distribution of the 
270 articles by their coupling strength to the set B. - 

JOINTLY CITED SEARCH 

Bibliographic coupling occurs between two papers if they cite 
one or more of the same papers. Another type of coupling occurs if 
two papers are cited by one or more of the same papers. There are 
605 papers which occur in one or more bibliographies with articles of 


set B. Of the 605, 101 are in the T.I.P. file. 
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Number of 
Articles 


1000 


100 
10 
i 2 3 \ 5 6 7 8 9 10 ll Coupling 
Strength 
Fig. 9.37. Distribution of articles with various bibliographic 
coupling strengths. 
CLUSTERING 


The user specified the article bi? as the article of greatest 
interest in the bibliography. The articles Des bes big» and Pig were 
ranked next in terms of interest. The clusters which resulted when 
these and various other articles were used as requests to the system 


are shown in Fig.'s 9.38 - 9.0. 


D E (Con't.) G J 
11-34-1897 1-11-310 3-83-1173 11-35-1365 
55-h1-132 h1-15-286 11-35-130 790-10-1102 
80-19-1915 64.6-h-186 55-h1-391 799-6~-1762 
612-2-719 F 55-h1-1h05 799-7 -1834 
139-8-118 S-B1-08F— ioe K 

11-36-3h2 H 80-18-26 

E eee T9-7-110- 80-18-1056 
TITLE 11-36- cols -20- 
Pnerire is 3-8 ieee 1258 
11-36-3142 790-7 -788 799-8-2097 ' 
11-37-180 Ti-37-377 

Fig. 9.38. 


The sets of articles included in, the cluster 
for Langmuir Probe Bibliography (Physicist 2). 
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Answers to Requests: 


A[Y(b, )]=A, i=1),16,17 ALY(b),)]=Ay 

A[Y(b,)]=A,  i=1,7 ALY(b, 2) ]=A,5 

ALY(b, )]=A, i=8,9,11 ALY(b, )]=(b, ) i=13,18 
ALY(b3) ]=A) ALY (bgbgb, by 7b) 9) J=Ay) 

ALY(b, )}=A, i=,6,20,21 ALY(b, bb, beb7bgbgb, 1b4),P) 6Py7 
ALY(b, 9) ]#Ag Py gPooPai)I™ Are 

ALY(b,) JA, ACY (a, )]=a, im1,..0,6 

ALY(b, ) ]=A, Aly(e, )]=a, 121,3,-+;6 
ALY(b, ,) ]=(cluster of 82 articles) ALY(e, )}*Ag 

Definitions of Clusters: 

Ay=(bgby)Py 6b), )UD Ag=(bpbj ode ¢2¢),8,)UI 
A,=(b by bgb, UE Age (bd, 5b) ),e¢,m ) 

A,=(bdsbgbgb) >) 9) U4, 4) 42) UF A o=(by fed) 

A =(b,bgb9) Ulf, fF), )UG Ay 1-Ar (bby of) 
Aga(bbeb—Ps gPooha1) U(d,8,8),) A127, UA2 UA3 UA, Us Ur od 50) 
Ag= (617i ghao%21) UB Ay 37Aya Ulog435),) 


Arhosts)UK 


Fig. 9.39. Answers to selected requests for Langmuir Probe 
Bibliography (Physicist 2). 
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Fig. 9.40. Relationship of Clusters for Langmuir Probe 
Bibliography (Physicist 2). 


COMPARISON 

The six preceding search strategies produced a total of about 500 
different articles. It was decided that this constituted too large a 
file to ask the user to evaluate. The file was, therefore, reduced to 
the 10 articles which appeared to have the greatest chance of being of 
interest to the user. These included the 83 articles which were retrieved 
by two or more of the sie aoa strategies, the 15 additional articles 
which were bibliographically coupled to the set B with a value of three 


or more and another six articles which contained the word, "probe", in 
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their titles in the sense of «4 measuring device. In seven other 
articles the word, "probe", was found in the title but it was used as 
a synonym for investigation (e.g. "three-field model as a probe of 
higher group symmetries"). 
The 10h articles presented for evaluation are listed in Fig. 9.1. 
The first column (A) is the identification. The next column (B) con- 
tains an indication (1) of those articles which are members of set B. 
The next six columns (C-H) note which articles were retrieved by each 
of the six search strategies: 
C - Column contains a one if the paper has the word, "probe", in 
its title. 
D - Number of authors of the paper that are also authors of 112 
papers in the Bibliography. 
E ~ Number of the 112 papers in the Bibliography that are cited by 
the paper. 
F - Bibliographic coupling strength of the paper to the set B. 
G@ - Number of papers which cite the paper and also cite one or 
more of the 112 papers in the Bibliography. 
H ~ Symbol of the paper in the clusters of Fig. 9.38 to 9.h0. 
(Note that the counts in Columns D and F do not include the authors 
or citations which match only because the article itself is in the 
set B.) 
The last colum (J) contains the evaluation code. Each document was 
assigned to one of the following five categories: 
1 - Of personal interest to user. 
2 - Of general interest. 


3 - Perhaps of general interest. 
(e.g. a probe may have been used as a tool in the experiment. ) 
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A BCDEFGH J A BCDEFGE 
T-129-1161 ----72- 3 LI-15-1018 ae 
1-132 -1435 -~-1-31- 3 4g-h-135 -- 
1-132-1hh5 Seo edocs - 3 hg-5-2hh -- 
1-132 -2363 --ll--- 3 4g-11-126 ll 
Meare -~--e 7 -- 3 eas rad aa 
1-13-1221 ----6--+ 3 9-20-7 -1 
1-137-3L6 -~----h-+ 3 4.9-20-269 -1 
1-138-1015 --l-1-- 3 55-1-132 ~- 
Neer ---- ; -- 3 artic ee 
1-10-77 ----h-- 3 -l)1-140 -< 
1-1)1-1h6 ~ee eh -- 5 55-1-1980 aH 
3-81-682 3 Sar Oe £3 80-18 -260 L131--»b 
Mog ee L1-323>d) : ponies means Ber : 1 Ky 
3-83-73 ---l2lé 0-18- - se - -- 
3-83-971 iS eh eon er 2 80-16-1056 --1-2-k 
3-8h-133 ----h-s 3 80-19-5 --1-1-- 
11-3)-1665 L1l--11»d, 1 80-19-1908 112236 Oh 
ees aes ee ein, 2 ae 
~ 34-261 ele Seek -19- “eS eae 
ened l1i-il : 1 3 1 Spee oer =< oe be l= k 
11-35-130 ee er oes Ws ae 1 16-37-21 i eee 
11-35-1130 p Ree Caras as ge vp 1 612-2-58 --1l-1-% 5 
Ty ae i a Ee ee 
3632 —~C(“‘ Ml 2 5 612-3-24 -l----- ) 
i a ee 
11-36-675 a7. a 646-4. -186 -1 
11-36-1659 ay 5 646-7 -324 -l 
11-36-1866 112-2- by 1 669-16-887 -- 
11-36-2361 -12-2-f, 1 790-6-9h)7 -- 
11-36-2363 11--8- bg i 790-6-990 -1 
11-36-2672 ~-l-+-l-se- 1 790-7 -580 ---121 
11-36-3135 -+1-9- e) 2 790-7 -788 ~-l--3- 
eee -l1ilhl °3 2 Ud a - ; -1l ' 
11-36-352 ----T7- 3 -8-319 -1l--- 
11-36-37L0 --l-l1- a 3 790-8 -720 l1l---1 
ee -11 ; ; -e, 1 Lee ---1l . - 
11-37-21 ell -= 2 -10- -+--3- 
11-37 -377 --212-m 3 1799-6-1h79 112h46 
11-37-19 --1-2-- 3 799-6-1)92 111337 
17-27-674 eo ae ee | 199-6-1762 -- 7-225; 
21-29-93 --l-l-- 3 799-7-110 ---h21h 
21-29-1165 -ll---- 1 799-7 -1329 2 ee 2 (eg 
21-29-1313 -ll---- 1 799-7 -1433 L1l---- dp 
21-30-182 113203 bg 1 799-7 -1517 -\-lll-- 
oe, Geese: | Ge. 
21-30-2021 ag ate ee 799-8-56 L11hs1 2 
21-31-1632 ~-l-l-- 4& 799-8-73 Ll1134- ¥ 
1-11-310 --1-12 er 2 799-8-7L8 -lll1-«& 
1-13-83 oe ee 5 799-8-920 --llliln 
1-15-286 sere 2-e 3 799-8-2097 --122- hi, 


Fig. 9.41. Langmuir Probe papers evaluated by physicist. 
Explanations of columns are given in text. 


l - Degree of interest cannot be determined by examination of the 
author(s). 


5 - Not of interest. 

In Fig. 9.42 the results of each of the six search strategies are 
tabulated for comparison. The results for bibliographic coupling are 
separated into two entries depending on the coupling strength. 

An examination of Fig. 9.42 indicates that the search strategies 
using the author, citation, and cited-by-same criteria yield compara- 
tively large sets of documents containing relatively few of the articles 
judged to be of specific pertinence by the user (evaluation category 1). 

Bibliographic coupling with the coupling strength greater than or 
equal to one yields such a large set of articles (270) that it would be 
more appropriate to compare it with a larger cluster such as the 85- 
article cluster which contained 26 of the category-1 documents. Let us 
therefore compare cluster Al3 with the set of articles with coupling 
strength greater than or equal to two. It will be seen that Al 
than half as large and yet contains three more of the category-1 docu- 
ments. 

It will be observed that the clustering procedure uses the same 
data used in bibliographic coupling but in a different way. Consider, 
for example, the 27 articles in Ay3 which are not part of the original 
bibliography. Seven have a coupling strength to B of only 1 and six 
have a coupling strength of 2. Whereas an articles like 1-129-1181 


with a coupling strength of 7 is not included in A,3- 


is less 
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Number of articles in each 


Number of articles evaluation catego. 
retrieved 1 2 3 Lb 


Search Strategy 


Title word 58 30 ll ol 2 = 66 
Author 120 18 #10 #15 2 = «8 
Citation 78 146 687 8 oO 5 
Bibliographic coupling 88 19 10 19 0 9 
(strength 2 

Bibliographic coupling 270 26 12 2 #2 «15 
(strength 1) 

Cited-by-same articles 101 13 8 h oOo 7 
Clustering (A,,) 43 22 8 7 oO 6 
Total abt. 500 31 16 32 & ail 


Fig. 9.42. Comparison of results of seven search strategies. 


Let us now turn our attention to the title word search. Fig. 9.12 
incidates that this search strategy retrieved four more of the category 
1 documents than were retrieved by the search strategies based on 
citations (i.e. bibliographic coupling and the 85-document cluster). 
This result provides an example of a case where title words provide a 
better basis for retrieval than do citations. Prewious experience 
would indicate that such is not generally the case. 

To determine why the clustering procedure was less effective in 
this case the five category-1 documents. which did not appear in any of 
the clusters generated were examined. Fine found that three of them 
(b,3> bie: and 21-29-1165) contain only a single citation and the other 
two (v,5 and 21-29-1313) contain only two citations. We are thus led 


to the same conclusion arrived at earlier that the clustering system, 


in general, has trouble properly placing documents with three or fewer 
citations. 

The remedy for this difficulty would be to use some additional 
types of partitioning data. In the example at hand, all 31 of the 
category-1 documents could be retrieved in the same cluster if the 
system used not only the partitions generated by citations but also 
those generated by certain keywords like "probe". 

One other observation may be worth noting. The article, by» was 
part of the original bibliography but was not included in any clusters 
with other members of the bibliography. A check of its bibliography 
showed that it had nine citations,which experience indicated should be 
enough to place it in the correct cluster. The author of this thesis 
decided, therefore, to ask the physicist if by was in a different area 
from the other 20 members of the bibliography. Before this was asked, 
‘powever, the evaluation of the 10 articles of Fig. 9.41 was made. A 

_@heck of this evaluation revealed that 19 of the 21 members of the 
original bibliography were placed in evaluation category 1 while Mio 


was placed in category 3. 


9-6 Summary of Results 


For purposes of comparison and emphasis let us summarize some of 
the significant features of the last three sections. In Fig. 9.3 two 


measures of the success of the clustering procedure are tabulated. 


Column four indicates how many of the pertinent articles were retrieved 


by the clustering system in each test. Colum five indicates what 


fraction of the articles retrieved were pertinent. The particular clus- 


ter selected for each test is specified in parenthesis in column three. 
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Number of Percent of Percent of 
papers Size of pertinent cluster 
specified Related papers in specified as 
Name of Test as pertinent Cluster cluster pertinent 
Bibliography 1 10 17(A,) 9/10=90%  9/17=53% 
(Sec. 9.31) 
Bibliography 2 16 6h (A) ) 1h/16=88 1h/64=22 
(Sec. 9.32) 
Bibliography 3(III) 27 48(A, 5) 20/27=7h 20/48=h2 
(Sec. 9.33) 
Bibliography 3(IV) 9 31(Ag) 8/9=89 8/31=26 
(Sec. 9.33) 
Bibliography 3(IIIC) 13 22(A,) 10/13=77 10/22=h6 
(See. 9.33) 
Category 1 3 105 28/43=65 28/105=27 
(See. 9.h1) (AQUA, UA) 
Category 2 30 133 19/30=6  19/133=1h 
(Sec. 9.42) (4, UA.) 
User 1 12(y) — 59(A,9) 9/12=75 9/59=15 
(Sec. 9.51) 
User 2 31(1) 43(A, 3) 22/31=71 = 22/443=52 


(Sec. 9.52) 


Fig. 9.3. 


Sections 9.3-5. 


One additional statistic may be of interest. 


Summary of the experimental results of 


This relates to 


whether the documents that are pertinent to a search are added to the 


cluster early or late in the process. 


For this purpose 50 clusters 


from Sec. 9.33 and 9.1 were analyzed and the number of articles of 


specified pertinence added in each quarter of the process was noted. 


These figures were averaged for the 50 clusters. 


show in Fig. 9h. 


The results are 


It will be seen that on the average almost half 


(45 %) of the pertinent articles which are included in the final 


cluster are added during the first quarter of the process. 


Average percent 
of bibliography 
added per 
quartile 


Vu-1/2 e-3/h 3/4-1 Quartile of 
Clustering 
Process 


Fig. 9.luu. Graph showing average percent of bibliography 
(or category) articles added during each 
quartile of the clustering process. 
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CHAPTER X 
CONCLUSIONS 


In this chapter we shall make some initial comments concerning the 
adequacy of the various components of the experimental system. Then 
certain conclusions about the clustering procedure will be given. Next 
the effectiveness of the overall model and system in retrieveing useful 
sets of documents will be evaluated. In the final section some possible 


avenues for further research will be suggested. 


10.11 MAC Time-Sharing System 
After five years' experience with batch processing computers, the 


author of this thesis found the MAC time-sharing system a refreshing 
change with some significant advantages. Jlet us briefly cosment on the 
use of the MAC system in three areas: in debugging programs, in test- 
ing and evaluating systems, and in operational retrieval functions. 
DEBUGGING 

It is estimated that the use of the MAC system cut by a factor of 
somewhere between two and ten the amount of time required to debug the 
experimental program. This, of course, is due to the fact that turn- 
around time for a run with time-sharing is of the order of a few 
minutes, whereas with batch processing it is usually several hours or 
days. 

The availability of more sophisticated debugging routines would 


have reduced debugging time even further. Some features that would 


have been of special help are multiple break points, conditional break 
points, an interpretive mode, more convenient patching, automatic up- 
dating of the English text, etc. 

One problem in using time-sharing for debugging is that it is 
almost too easy to make changes to a program and re-run it. This 
results in one making a change before its consequences have been fully 
considered. Part of the answer to this problem lies in self discipline 
on the part of the programmer. It will also help when a computer be- 
comes available on a 2li-hour basis so one is not tempted to try to rush 
through a change before a maintenance or test session. 

Two minor improvements to the consoles would help. A less noisy 
console would allow the user to more effectively contemplate a problem 
at the same time the computer is printing out some results on the con- 
sole. Also a neon light showing when the console is being serviced by 
the central processor would be of considerable value. 

SYSTEM TESTING 

After one has obtained a program that is debugged and performs 
according to specification, it often becomes apparent that the original 
specifications for the program need changing. This may result in some 
modifications to the program, or if the change is extensive, it may 
require rewriting the whole program. The same advantages and problems 
that time-sharing has in debugging are also in evidence in this cycle 
of program specification and respecification. 

OPERATIONAL RETRIEVAL 

Let us now consider what would happen if one were to decide to use 

the MAC system or one like it as an operational information retrieval 


system serving a community of real users. 
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If all of IBM 1302 disc were used for data, a file 30 times the 
size of the current T.I.P. file could be stored. This would allow one 
to increase the time span covered by the periodical literature from 3 
to perhaps 10-15 years and also add some non-periodical literature. 
All of the files could also be completely inverted. There would 
probably still be room left for coverage of another discipline about 
the size of physics. If magnetic tapes were used, coverage could be 
increased even further by loading the dise with different data on 
different days of the week. 

Let us assume that the current limit of 30 users on line at once 
is maintained. The response time for simple requests for information 
would probably be acceptable to most users. This would be 1 second of 
computer time and 1-30 seconds of real time. The response time to 
more complex requests would probably be found objectionable to some 
users. Retrieval of a cluster, for example, might take 40-50 seconds 
of computer time and 5-10 minutes of real time. 

The response time to complex requests could be improved by a 
factor of 5-10 if the supervisory system were modified to allow some 
type of direct access to the disc. The current supervisory program is 
designed for the storage of files that are constantly changing. This 
places a penalty factor of 5-10 of the accessing of files that never 
change, such as those found in a library. 

One of the biggest difficulties with using the MAC system as an 
information retrieval service is that it has no provision for the trans- 
mission, display and reproduction of analog information. Such a 
capability would probably be needed, for example, if the system were to 


supply the abstracts or total text of articles. 
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Thus, with the current system a person with a console in his 
office might be able to identify which articles are of interest, but 
he would still have to go to the library to get them. (He could per- 


haps have his own microfilm system, but this would be very expensive.) 


10.12 T.I.P. Document Collection 

The first tests of the clustering procedure were performed using 
a single volume of the Physical Review. As the data base was increased, 
some marked changes in the characteristics of the procedure were noted. 
One of the major causes of these changes was the fact that the parti- 
tioning sets for the single volume are all quite small, whereas the 
partitions for the total T.I.P. file have a wide range of sizes. 

The question arises as to whether an increase of perhaps one or 
two orders of magnitude in the current document file might further 
change the way the procedure operates. In an attempt to answer this 
question, let us first note that such an increase would necessarily 
involve coverage of some additional branches of science such as 
chemistry, mathematics and/or electrical engineering. This would be 
true since a sizeable fraction of the significant physics periodical 
literature that is being published is already being added to the T.I.P. 
file. This implies that the size of the clusters generated by the 
procedure would not significantly change even if the size of the 
collection were greatly increased. 

Also the use of an inverted data storage system would keep the 
access time to any one piece of information relatively constant even 
when the size of the file were measurably increased. It is, therefore, 


concluded that the system would operate in essentially the same way it 
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currently does even if the document file were scaled up in size by 


several orders of magnitude. 


10.13 Partitions 

The experimental results as summarized in Fig. 9.3 are evidence 
of the fact that partitions based on citation information constitute a 
useful data base for the measure of relatedness and the clustering 
procedure. There were, of course, a few documents which were not in- 
cluded in the cluster to which it appeared they should belong. In 
almost all of these cases it was found that the documents had three or 
fewer citations which was evidently an insufficient number to properly 
place them in their appropriate cluster. 

From this, one might conclude that the clustering system as 
presently programmed may not be an effective retrieval tool for a file 
in which a large fraction of the documents have three or fewer cita- 
tions. Actually what may be needed in such a file is a modification in 
the type or types of partitioning information utilized so that parti- 
tions are also generated by users, title words, authors or some other 
parameter(s). A case where other types of partitionings would have 


helped even in the citation-rich T.I.P. file was described in Sec. 9.52. 


10.14 Storage Structure 

One general conclusion that was reached in this project is that in 
a dynamic system an attempt should be made to give the data a general 
structure instead of a structure tailored to one specific requirement . 
This will allow a flexible approach to new uses of the data. An in- 


verted file structure coupled with the raw data file was suggested as a 
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possible general filing system. 

It is argued in Sec. 7.22 that an inverted file should occupy 
about the same amount of storage as is occupied by the file which is 
being inverted. This claim was verified for the data in the T.I.P. 


file. 


10.15 Retrieval Language 
The fact that both the syntax and vocabulary of the retrieval 


language is table-driven(i.e. they are specified by tables) was con- 
sicdered to be a significant advantage. As modifications in the 
structure of the request and in the words used to describe the request 
suggested themselves, they were easily incorporated into the system by 
a minor modification in the appropriate table. 

Currently no one besides the author of this thesis has had 
sufficient experience with the retrieval language to evaluate it. let 
me, therefore, make some admittedly biased observations. 

First, the language was found to be easy to remember even after a 
iapse of several months in which it was not used. The language was als 
found to have considerable room for future growth. Indeed a large 
number of additional verbs and adjectives that would be useful in 
retrieval suggested themselves. The ability to make a request for 
information as complex or as simple as needed was also found helpful. 
Actually only a maximum of about three or four levels of structure has 


been utilized so far. 
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10.2 Evaluation of Procedure 

In this section we shall discuss whether the procedure as described 
in Chapter V has the general characteristics which it needs for opera- 
tion as a retrieval tool. An evaluation of the actual utility of the 
current procedure and experimental system in satisfying user requests 
will be discussed in the next section. 
CONVERGENCE 

Considerable difficulty was encountered with the earlier cluster- 
ing procedures because they occasionally entered into a non-terminating 
cycle. The steps taken to prevent such cycles have been described in 
Sec. 5.53. The experience gained over the past several months supports 
the contention that the current procedure will always converge in a 
finite number of iterations to an answer cluster or to a comment that 
the request is inconsistent. 
GENERAL-SPECIFIC 

From Fig. 9.3 one can conclude that the use of a bias in the 
correlation network does, indeed, allow one to increase or decrease the 
size of the answer cluster. That the value to be given the bias can be 
automatically determined by the composition of the request has been 
experimentally verified by the results of Sec.'s 9.3-5. 
AMBIGUITY RESOLUTION 

In Chapter IX examples are given showing how some of the possible 
answer clusters that satisfy a given request can be eliminated by 
specifying additional documents to be of interest or not of interest 
(additions to the Y and Z sets). It is clear that one can arrive at a 
point at which only one cluster satisfies the request by the appropriate 


additions to the Y and Z sets. From Fig. 9.7 one might conclude that 


on the average at least two members of Z are required to make a request 
unambiguous. Of course, even if the request is ambiguous, the desired 
answer cluster may still be found. Yor example, in Sec. 9.31 seven 


out of the ten requests with Y=(b, ) resulted in A, and yet all seven 


1 
are ambiguous. 
INCONSISTENCY RECOGNITION 

From the results of Pig. 9.5 we conclude that not only does the 
procedure mark as inconsistent those requests for which there is no 
answer cluster, but it also decides that some of the requests are 
inconsistent, for which a valid answer cluster exists. This difficulty 
is not considered serious, however, since the user can be coupled into 


the system and can guide the procedure in the right direction and 


reshape the request if an inconsistent situation is reached. 


10.3 Evaluation of System 

In the last section several conclusions were stated concerning the 
characteristics of the clustering procedure. In this section we will 
discuss the more general problem of the effectiveness of the overall 
system as a retrieval tool. 

From Fig. 9.43 we note that the percent of pertinent documents 
retrieved by clustering ranges from 6 to 90 Ge This compares favor- 
ably with a published retrieval efficiency of about 50% for other 
automatic retrieval systems. 

Almost all of the pertinent documents which were not retrieved 
were found to have three or fewer citations. This would give one the 
hope that with an expanded data base for the partitions the 6-90 % 


retrieval efficiency could be improved even more. 
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We next note from Fig. 9.43 that from 7 to 86% of the retrieval 
documents are not part of the set of documents of known pertinence. 

Let us assume for a moment that all of these documents are irrelevant, 
Many users would still find this acceptable since a quick examination 
of the titles could be used to select the articles of interest from 
the larger set. 

Now let us consider whether or not some of the additional articles 
might really be found to be of interest by a user who has selected the 
cluster in which they are found. 

First, we observe that for the tests of Sec. 9.3 some of the 
articles in the clusters were published after the October IEEE Proceed- 
ings came out and thus had no chance of being part of tne bibliographies 
even if they were pertinent. This is the case, for example, with the 
following documents of Fig. 9.21: de» eg» Kip Kyo: Kj3? Kip Mjor¢e+3 
Migs Mo72 P32 43 > and ds + 

Also the authors of the three bibliographies used probably did not 
intend to exhaustively cover the area. They may have only selected 
what they considered to be the best reference(s) available for each 
specific concept or topic. 

These arguments do not hold for the articles added by the cluster- 
ing procedure to the categories of Sec. 9.4. The categories are 
supposedly exhaustive and should include all but the most recent 
articles. In defense of the additional articles in the clusters let 
us give two examples. The first title below is included in the 


Physical Review category on "Luminescence" while the second is not. 
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1-133-1163 

‘Optical properties of cubic SiC, luminescence of nitrogen- 

exciton complexes, and interband absorption. 

1-133-2023 

Optical properties of 15R SiC, luminescence of nitrogen- 

exciton complexes, and interband absorption. 

As a second example, consider cluster A, of Sec. 9.2. This 
cluster contains three articles that are classified in the category, 
"Erbium", in Physics Abstracts. Of the 31 other articles in the 
cluster three contain the word, “erbium”, in their title and seven 
more contain the word, "erbium", in the abstract or text. All of the 
remaining articles have at least one of the other 1) rare earth elements 
mentioned in the title. The following is an example of an article 
contained in the cluster A, but not included in the erbium category. 

1-126-726 


Energy levels and crystal-field calculations of Er 
yttrium aluminum garnet. 


+ 


3 in 


For the tests with users described in Sec. 9.5 the percentage of 
the cluster that is pertinent would be 27/59#h6 % for User 1 and 
27/113=86 % for User 2 if all of the articles of questionable (or 
general) pertinence were counted. The user might even find some of 
those articles judged non-pertinent to be of interest if he were 
allowed to examine the actual article instead of just the title. 

The foregoing arguments and data suggest that a user might, on the 


average, find at least half of the documents in a cluster of interest. 


It is perhaps significant that the percentage of pertinent docu- 
ments retrieved is lower in the tests for the two categories than for 
the other tests. The other tests involved bibliographies compiled by 


experts (authors and users) while the categories were generated by 
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indexers, 

One might also note that the tests of Sec. 9.3 have higher per- 
centages of pertinent documents retrieved on the whole than do the 
tests of Sec. 9.5. This could be explained by the fact that the users 
of Sec. 9.5 based their decisions on the titles, authors, and citations 
of the articles, while the authors of Sec. 9.3 had undoubtedly read the 
articles they cited. The conclusion to be reached here is that the 
clustering procedure tends to do best in those tests where it was 


compared to sets generated by the careful consideration of experts. 


In conclusion, the experience of this thesis indicates that 
clustering may be a useful tool to research workers who desire informa- 
tion covering either a very specific or a very broad area of interest. 
It is our opinion that further development and research is both 


warranted and essential. 


10.4 Suggestions for Further Research 
The suggestions to be presented here have been divided into 


three general categories: 
(1) Data base and data structure 
(2) Clustering procedure and interaction language 


(3) Theoretical problem 


10.41 Data Base and Structure 
OTHER DATA BASES 
It has already been suggested (Sec. 10.13) that the clustering 


system should be tested on other types of partition data. Some of the 
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other types of partitions that might be tried are listed in Sec. 6.22. 

It is also suggested that tests be made of the simultaneous use of 
several types of partitioning data. In this connection one might 
consider the use of a weighting factor for the partitions which might, 
for example, give a larger weight to partitions generated by citations 
than to those generated by title words. 

Of particular interest would be a system which utilized the type 
of usage data described in Chapters II and III. 
CHANGING FILE 

There are a number of questions relating the fact that a document 
collection is continually changing. What should happen when documents 
are added to or deleted from the file? Can the user be automatically 
notified of new documents of interest? In this connection one might 
want the user to permanently store those clusters found to be of 
interest. Then as nwe documents come into the file they can be com- 
pared against the clusters. The user would then be notified of those 
articles which were valid members of his clusters. 
CODING 

There is also need for additional work on the problem of data 
coding and compression. For example, one might be able to reduce 
storage requirements considerably by storing codes for all (or certain) 
authors’ names in the raw data file. This may be true of the other 


types of data also. 


10.42 Procedure and Language 
There are a number of directions in which the clustering procedure 


and interaction language might be extended. One objective might be to 
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make a wider class of statements acceptable and understandable to the 
system. This might involve increasing the vocabulary and/or allowing 
other syntactic forms. 
PARSING BY CONTEXT 

As a specific suggestion we note that the current system determines 
the function of (parses) a word by a simple table look-up. A word 
cannot have a dual function depending on its context. Thus if one wants 
to use "p" as an abbreviation for print (p. the titles of set 1), this 
would currently exclude its use say as an abbreviation for paper or as 
the initial in an author's name ("get articles by 'P. A. Jones'" would 
however be acceptable). It should be possible, however, to distinguish 
between these different uses, if one utilizes the context. 
GRAPHIC DISPLAY 

A more radical extension of the language would be through the use 
of some type of graphical device. For example, it might prove useful to 
display part of the document network on an oscilloscope and to allow the 
user to specify the interesting and non-interesting documents by means 


of a light pen. 


In addition to increasing the flexibility of the language, one 
might also want to allow the specification of some other functions. Let 
us suggest some additional functions that the clustering procedure 
might appropriately perforn. 

CLUSTER SIZE 

A user might want to limit the size of the answer cluster to some 

specified range at the outset. (e.g. "Get between 3 and 7 articles 


related to Phys. Rev. v. 136 p. 1899.") This could be accomplished by 


231 


increasing or decreasing the bias enough so that the size of the answer 
cluster fell within the specified range. 
DATA BASE 

It would also be of value to a user if he could specify the type of 
partitioning data to be used by the clustering procedure. Thus the 
command, "Get the articles related by authors and users to Phys. Rev. 
Letters v. 11 p. 6", would use the partitions generated by both authors 
and usage data to create the answer cluster. This control could be 
extended to select for the data base certain classes of partitions 
within a broad type. For example, a request of the type, "Get the 
articles related by M.I.T. faculty users to Phys. Letters v. 7 p. lk", 
would allow the user to single out for use that type of partitioning 
which he thought would yield the best results. 

CLUSTERS GF AUTHORS ,ETC. 

There is no real reason why clusters must be limited to sets of 
documents. It may be useful to generalize the system to allow clusters 
to be formed of other types of entities such as authors, locations, 
words, etc. It might be very helpful, for example, to be able to deter- 


mine the cluster of scientists that are working in a given field or area. 


10.43 Theoretical Problems 
ANSWER CLUSTER DEFINITION 

Some modification to the definition of an answer cluster may be of 
value. For example, should a change be made to the requirement that all 
the documents specified as interesting be in the cluster? 


NOISE 


There will, of course, be cases where certain documents are 
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mistakenly included together in a set of interest. This may arise, for 
example, from an incorrect judgement on the part of a user or perhaps 
by a clerical slip. The effect of this type of noise on the system 
should be investigated. Also suitable steps should be taken to maintain 
the integrity of the data base through editing processes, 
SELF-SUSTAINING RUTS 

Consider an information retrieval system which is based on the 
data generated by its users. This might be one based on usage data or 
on citations. Is it possible in such a system for a self-reinforcing 
feedback loop to be created which cannot be altered? For example, if 
users are supplied documents on the basis of past use, this may create 
new partitions which only serve to reinforce the results of the old 
partitions. 
EVALUATION MEASURE 

The measure described in Chapter III was not suggested for use in 
rating the merit or value of documents. Its function was to group 
together documents that were mutually pertinent. If a suitable way 
could be devised for measuring the worth of documents, this would be of 
considerable aid to users. Perhaps this would take the form of some 
type of concensus of opinion of the previous users of the documents. 
TRAILS VS. SETS 

In the article already cited by V. Bush the model suggested for 
information retrieval was a trail leading from one pertinent document 
to the next. The model used in this research endeavor is the partition 
ing of the file into two subsets. Actually both models have useful 
features. In some cases there is a definite pattern or trail which 


should be followed in consulting the documents related to a given 
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subject. In other cases the order in which the documents should be 
examined is apparent from their publication data. In still other cases 
there is no particular order in which the documents need be consulted. 
Thus it would seem that one might want to include both the ideas of 
sets of documents and trails of documents in a more general information 
retrieval model. 
PREDICTIVE USAGE 

As additional information becomes available on the types of 
questions that are asked by users and the sets of documents that seem 
to satisfy them, it may be possible to design a system involving some 
form of prediction of what a user really wants when he asks a given 
question. This might even be extended to involve trends in document 
usage, so that future document use is extrapolated on the basis of 


past use. 
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APPENDIX A 


MEASURES OF RELATEDNESS 


Some of the measures which have been proposed for use in informa- 
tion retrieval are tabulated below. Measures (1) to (6) were originally 
suggested in terms of frequency counts. Measures (7) and (8) were first 
proposed in terms of probabilities. For purposes of comparison we have 
attempted to express each measure in the table both in terms of 
probabilities and frequency counts. In the case of measure (5) this 
was not possible. 

The definitions for the symbols used in the table and the con- 
version formulae for going from probabilities to frequency counts and 
back again are found in Sec. 3.1. It was necessary to add superscripts 
to the frequency counts in the table to distinguish between some 


additional counts which appear in these measures. Thus no is the 


d 


number of partitions in which the subset of interest contains document 


j but not i. 
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Name Range C - Probabilities C - Frequency Counts 
Jee 1 
p(x, x ) i Ny 
Comparison 01/2 |C= a C= ey 
Function p(x; )+p(x5) N.+N, 
(Martin) oe 
p(x;x, 2 Ny 
Associatio O-1 |S= ss 
Measure+ 15 fi p(x¢ +B (x5 ) PCP ) Ny +N -NS 
( Doyle -1962 ) j jig 
p(x}x4)-p (xp )p(4) ni tyOO_ y ty 
Peereere Redcat fa TO 4 3 =) ot 
Gocen cient p(x,x 5s PCxg 5) #PCpS Py, NMG NS Ns 5 
Colligation??’° 
(Maron-1960) 
L 11,00 0,,01 
p(xpx; )-p(x; }p(xt ) N- N. ia) 
Pearson -l+1 CAE SRTeTE a T= Oy a ee 
Correlation, 7 P(x P(x; )P(x5 )px5) xO 3N5 


Coefficient ’ 
(Borko-1962 ) 


([NS5N yO yi On eae Ne 


ie Ni ij 
Chi Square 0> oo --- x° =) 
Formula with Ny ne ne n° 
Yates fe ded 
Correction 
(Stiles-1961) 
Get) 11 
Cosine ), Ol |R= PM%a%5 R= “ig 
Function ] (x*) (xr) Ty yt * 
(Salton-1963) PAX, PAX, ee 
ab ab ab 
p(x.x Us N NN 
. Average O71 {c= s p(xex® )log = = C= >) — 310g 
Information- a,b J p(x; p(x.) a,b N NN; 
Theoretic =0,1 J =0,1 
Correlation 
Coefficient! 9? 4 
(Watanabi-1960) 
1 
p(x;x) ¥ N Ne 
Information- -@>m/C= log —_—= C= log— 
Theoretic p(x; p(x’, ) NIN. 
Correlation, . J aie 
Coefficient 


(Fano-1958 ) 
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